[00:34:13] 10Tool-containers, 10Toolforge: Provide a Redis container for use within a tool's namespace - https://phabricator.wikimedia.org/T360378#9924563 (10bd808) 05In progress→03Resolved I have added https://wikitech.wikimedia.org/wiki/Help:Toolforge/Redis_for_Toolforge#Redis_containers to make the new contain... [00:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:59:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:02:37] !log dcaro@urcuchillay redirects END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [01:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Redirects/SAL [02:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:20:57] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9924911 (10Marostegui) So in terms of data, my recap is: - root password is differentr from production - the data that is present there is sanitized... [06:29:45] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb1017 (s1) - https://phabricator.wikimedia.org/T367778#9924916 (10Marostegui) Just for the record, I have been investigating the current lag on clouddb1019:3314 - it is because of this: ` root@clouddb... [06:33:09] 10cloud-services-team (FY2023/2024-Q3-Q4), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#9924933 (10KCVelaga_WMF) @fnegri I only plan to have final aggregated tables, so it should be much less than 25 GB limit. I will create the test db a... [07:02:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-29 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:12:46] (03approved) 10dcaro: Replace Python 3.9 type aliases with 3.7-compatible aliases [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/46 (https://phabricator.wikimedia.org/T368463) (owner: 10anticomposite) [07:15:51] 10Cloud-VPS (Debian Buster Deprecation), 10WMIT-Infrastructure: Cloud VPS "osmit" project Buster deprecation - https://phabricator.wikimedia.org/T367543#9925002 (10LorenzoStucchi) VM [[ https://openstack-browser.toolforge.org/server/osmit-uno.osmit.eqiad1.wikimedia.cloud | osmit-uno.osmit.eqiad1.wikimedia.clou... [07:22:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-29 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:29:53] (03merge) 10aborrero: Replace Python 3.9 type aliases with 3.7-compatible aliases [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/46 (https://phabricator.wikimedia.org/T368463) (owner: 10anticomposite) [07:31:40] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925038 (10dcaro) >>! In T348643#9921767, @CDanis wrote: > Ha, I had also made a silly little dashboard yesterday bu... [07:35:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-29 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:44:41] 10Cloud-VPS, 10Tool-spacemedia: DNS name resolution failure with cdn.esahubble.org from Cloud VPS & Toolforge - https://phabricator.wikimedia.org/T368439#9925051 (10taavi) [07:50:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-29 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:54:48] 10Cloud-VPS, 10Tool-spacemedia: DNS name resolution failure with cdn.esahubble.org from Cloud VPS & Toolforge - https://phabricator.wikimedia.org/T368439#9925072 (10taavi) Seemingly this works now: `lang=shell-session taavi@tools-bastion-12:~ $ dig cdn.esahubble.org ; <<>> DiG 9.18.24-1-Debian <<>> cdn.esahub... [08:01:25] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-29 [08:06:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-29 [08:12:10] 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9925120 (10ABran-WMF) All done, ready for the views creation. [08:22:39] 10Cloud-VPS (Debian Buster Deprecation), 06Editing-team: Cloud VPS "visualeditor" project Buster deprecation - https://phabricator.wikimedia.org/T367559#9925151 (10Esanders) Deleted [08:29:39] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:31:23] 10Cloud-VPS (Debian Buster Deprecation), 06Editing-team: Cloud VPS "visualeditor" project Buster deprecation - https://phabricator.wikimedia.org/T367559#9925186 (10Jdforrester-WMF) 05Open→03Resolved a:03Esanders [08:31:58] (03open) 10taavi: Add dark mode support [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/9 [08:34:39] RESOLVED: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:37:43] (03open) 10taavi: Run as a build service tool [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/10 [08:38:09] 06cloud-services-team, 10Toolforge: toolforge: maintain-kubeusers crashes if LDAP server terminates session - https://phabricator.wikimedia.org/T368512 (10aborrero) 03NEW [08:40:42] (03update) 10taavi: Run as a build service tool [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/10 [08:40:44] 06cloud-services-team, 10Toolforge: toolforge: maintain-kubeusers crashes if LDAP server terminates session - https://phabricator.wikimedia.org/T368512#9925271 (10taavi) Duplicate of {T352011}? [08:40:56] (03approved) 10dcaro: Add dark mode support [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/9 (owner: 10taavi) [08:41:11] (03merge) 10taavi: Add dark mode support [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/9 [08:41:33] (03update) 10taavi: Run as a build service tool [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/10 [08:49:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:51:31] (03update) 10aborrero: kyverno_pod_policy: set validation to Enforce [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/46 (https://phabricator.wikimedia.org/T368141) [09:05:53] !log taavi@cloudcumin1001 cloudinfra-nfs START - Cookbook wmcs.openstack.migrate_project_to_ovs [09:06:55] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: toolforge: kyverno: change policies to Enforce - https://phabricator.wikimedia.org/T368141#9925377 (10aborrero) Before setting policies to Enforce, I've checked again the policy reports. There are a bunch of policy violations: `lang=shell-session abor... [09:07:21] !log taavi@cloudcumin1001 cloudinfra-nfs END (PASS) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=0) [09:07:22] (03approved) 10dcaro: kyverno_pod_policy: set validation to Enforce [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/46 (https://phabricator.wikimedia.org/T368141) (owner: 10aborrero) [09:08:41] thanks! [09:08:43] (03merge) 10aborrero: kyverno_pod_policy: set validation to Enforce [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/46 (https://phabricator.wikimedia.org/T368141) [09:11:15] 10Toolforge: Rust buildservice failed to clone a repository from GitHub - https://phabricator.wikimedia.org/T362404#9925409 (10Tgr) [09:11:22] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.155-20240626090852-f6b198f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/351 (https://phabricator.wikimedia.org/T368141) [09:11:43] 06cloud-services-team, 10Toolforge: toolforge: kyverno: enable monitoring - https://phabricator.wikimedia.org/T368515 (10aborrero) 03NEW [09:15:17] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [09:15:27] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [09:17:25] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [09:17:36] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [09:20:50] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.155-20240626090852-f6b198f9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/351 (https://phabricator.wikimedia.org/T368141) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:23:50] (03open) 10dcaro: Revert "envvars-api: bump to 0.0.50-20240619035607-42829b67" [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/352 (https://phabricator.wikimedia.org/T368516) [09:24:20] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516 (10dcaro) 03NEW [09:24:49] 06cloud-services-team, 10Continuous-Integration-Infrastructure, 10MediaWiki-Vagrant, 07Composer, 07Upstream: Composer activity from Cloud VPS hosts can be rate limited by GitHub - https://phabricator.wikimedia.org/T106452#9925469 (10hashar) 05Open→03Resolved a:03bd808 This was solved for #conti... [09:25:30] (03open) 10aborrero: kyverno_pod_policy: don't autogenerate validation rules [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/51 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368141) [09:25:35] 10Toolforge (Toolforge iteration 11): envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961#9925474 (10taavi) 05Resolved→03Open 0.0.50 (or a later version) still needs to be deployed to `tools`. [09:25:47] (03approved) 10sstefanova: Revert "envvars-api: bump to 0.0.50-20240619035607-42829b67" [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/352 (https://phabricator.wikimedia.org/T368516) (owner: 10dcaro) [09:25:54] 10Toolforge: Rust buildservice failed to clone a repository from GitHub - https://phabricator.wikimedia.org/T362404#9925485 (10hashar) →14Duplicate dup:03T362095 [09:25:59] 10Toolforge (Toolforge iteration 11): envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961#9925490 (10taavi) [09:26:02] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516#9925491 (10taavi) [09:26:54] 10Toolforge (Toolforge iteration 11): envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961#9925493 (10taavi) >>! In T367961#9925474, @taavi wrote: > 0.0.50 (or a later version) still needs to be deployed to `tools`. .. and {T368516} needs fixing before we can... [09:33:38] 10Cloud-VPS, 10Tool-spacemedia: DNS name resolution failure with cdn.esahubble.org from Cloud VPS & Toolforge - https://phabricator.wikimedia.org/T368439#9925548 (10Don-vip) I confirm, this morning it works. It was not working for a few days. Feel free to close the ticket if nothing can be done to understand w... [09:34:21] FIRING: MaintainKubeusersHang: maintain-kubeusers last finished run is 28.66M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [09:35:16] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-3 is lagging behind the primary, the current lag is 403672 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [09:36:51] !log taavi@cloudcumin1001 spacemedia START - Cookbook wmcs.openstack.quota_increase (T368464) [09:36:54] T368464: Request quota increase for spacemedia project - https://phabricator.wikimedia.org/T368464 [09:36:59] !log taavi@cloudcumin1001 spacemedia END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T368464) [09:38:02] (03update) 10aborrero: kyverno_pod_policy: don't autogenerate validation rules [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/51 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368141) [09:39:28] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Tool-spacemedia: Request quota increase for spacemedia project - https://phabricator.wikimedia.org/T368464#9925560 (10taavi) 05Open→03Resolved a:03taavi [09:41:51] (03open) 10aborrero: d/changelog: bump to 0.103.9 [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/47 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368463) [09:43:52] (03merge) 10aborrero: d/changelog: bump to 0.103.9 [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/47 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368463) [09:45:29] FIRING: PuppetAgentNoResources: No Puppet resources found on instance runner-1024 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:50:29] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:53:32] 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Upgrade clouddb1021 to bookworm - https://phabricator.wikimedia.org/T365450#9925630 (10BTullis) 05Open→03Declined We are about to bring an-redacteddb1021 into service, so I will decline this ticket in favour of a deco... [09:55:29] FIRING: [4x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:55:34] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9925643 (10dcaro) [09:59:52] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789) [09:59:54] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9925649 (10fnegri) > there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine Do you know what... [09:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:59:58] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [10:00:01] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T309789) [10:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:00:28] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789) [10:00:29] FIRING: [5x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:04:35] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9925684 (10fnegri) 05Resolved→03Open [10:05:29] FIRING: [7x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:05:54] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9925681 (10fnegri) [10:08:06] RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 28.66M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [10:10:29] FIRING: [8x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:13:37] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9925706 (10fnegri) > That query is pretty expensive and it is basically scanning 146M rows, which is unlikely it'll be finish before the 10800 mark. A... [10:15:29] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:18:23] 10Cloud-VPS: cloud vps: fix flavor g3.cores16.ram32.disk20 id 37ed9aaa-35b2-4141-8bc4-272ec8bbc303 - https://phabricator.wikimedia.org/T337010#9925712 (10taavi) 05Stalled→03Resolved Closing since this is being resolved with the new g4 flavors. [10:18:50] (03approved) 10dcaro: kyverno_pod_policy: don't autogenerate validation rules [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/51 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368141) (owner: 10aborrero) [10:23:14] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9925735 (10fnegri) 05Open→03In progress [10:24:55] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925737 (10dcaro) There's definitely some load coming in: {F55892394} Though no spikes on the latencies so far: {F... [10:26:55] (03merge) 10aborrero: kyverno_pod_policy: don't autogenerate validation rules [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/51 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368141) [10:29:12] 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts - https://phabricator.wikimedia.org/T368354#9925785 (10BTullis) p:05Triage→03Medium Thanks both. I have created a merge request here:... [10:30:47] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9925788 (10fnegri) [10:31:49] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9925800 (10fnegri) The lag that started this morning in clouddb1015 (mysql.s4) is even harder to explain, as that is a "web" host, with wmf-pt-kill runn... [10:37:00] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925808 (10dcaro) changed the graphs to use rate of the stat, instead of the raw counter value, now there's some inf... [10:39:17] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925810 (10dcaro) The host with the old non-error-reporting drives has a similar shape (just a bit higher latency):... [10:39:30] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925811 (10dcaro) read has even less difference, and flush only happens for the os-dedicated drives. [10:39:32] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.156-20240626103707-3aa9727d [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/353 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368141) [10:39:50] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:40:01] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [10:43:31] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:43:42] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [10:44:26] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.156-20240626103707-3aa9727d [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/353 (https://phabricator.wikimedia.org/T362050 https://phabricator.wikimedia.org/T368141) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [10:47:29] 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm - https://phabricator.wikimedia.org/T327742#9925836 (10hnowlan) [10:51:36] (03update) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [11:00:21] FIRING: MaintainKubeusersHang: maintain-kubeusers last finished run is 28.66M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [11:09:15] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: toolforge: kyverno: change policies to Enforce - https://phabricator.wikimedia.org/T368141#9925914 (10aborrero) 05In progress→03Resolved [11:09:33] 06cloud-services-team, 10Toolforge: [k8s,infra] track PSP migration plan - https://phabricator.wikimedia.org/T364297#9925920 (10aborrero) [11:09:56] (03update) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [11:10:21] RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 28.66M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [11:13:29] (03update) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [11:16:16] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9925952 (10aborrero) scheduled for tomorrow 2024-06-26 [11:16:29] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T309789) [11:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:16:37] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [11:16:53] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9925957 (10aborrero) [11:24:32] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9925999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin10... [11:37:44] 10Toolforge (Toolforge iteration 11): New upstream release for Pywikibot - https://phabricator.wikimedia.org/T363631#9926063 (10taavi) a:03taavi [11:42:03] 10Toolforge (Toolforge iteration 11): New upstream release for Pywikibot - https://phabricator.wikimedia.org/T363631#9926070 (10taavi) 05Open→03Resolved [11:51:58] 06cloud-services-team, 10Toolforge: toolforge: maintain-kubeusers crashes if LDAP server terminates session - https://phabricator.wikimedia.org/T368512#9926095 (10aborrero) →14Duplicate dup:03T352011 [11:52:03] 10Toolforge: maintain-kubeusers occasionally crashes to a LDAP connection error - https://phabricator.wikimedia.org/T352011#9926097 (10aborrero) [11:55:05] (03approved) 10dcaro: Revert "envvars-api: bump to 0.0.50-20240619035607-42829b67" [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/352 (https://phabricator.wikimedia.org/T368516) [11:55:10] (03update) 10dcaro: Revert "envvars-api: bump to 0.0.50-20240619035607-42829b67" [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/352 (https://phabricator.wikimedia.org/T368516) [11:55:35] (03merge) 10dcaro: Revert "envvars-api: bump to 0.0.50-20240619035607-42829b67" [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/352 (https://phabricator.wikimedia.org/T368516) [11:56:51] (03open) 10dcaro: envvars-api: bump to 0.0.50-20240619035607-42829b67 again [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/354 [11:57:20] (03update) 10dcaro: envvars-api: bump to 0.0.50-20240619035607-42829b67 again [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/354 (https://phabricator.wikimedia.org/T368516) [12:10:06] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-06-21 - https://phabricator.wikimedia.org/T368250#9926144 (10fnegri) After 5 days, replication is still stuck on the same transaction. I will try to apply the transaction manually, with the followin... [12:23:04] (03open) 10aborrero: deployment: use Recreate pod replacement strategy [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/52 [12:27:10] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-06-21 - https://phabricator.wikimedia.org/T368250#9926194 (10fnegri) Both `STOP SLAVE` and `systemctl stop mariadb` were hanging, I had to use `kill -9` :/ The manual UPDATE query is currently runn... [12:28:32] FIRING: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [12:30:10] (03update) 10aborrero: deployment: use Recreate pod replacement strategy [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/52 [12:31:01] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9926206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 f... [12:49:56] FIRING: CloudVPSDesignateLeaks: Detected 12 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:50:07] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-06-21 - https://phabricator.wikimedia.org/T368250#9926265 (10fnegri) My theory was apparently correct, because the UPDATE query (`UPDATE phash_dhash SET dhashv=NULL;`) completed in just 11 minutes.... [12:55:48] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9926272 (10fnegri) a:03fnegri I will run the `sre.wikireplicas.add-wiki` cookbook [12:56:42] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9926275 (10fnegri) 05Open→03In progress p:05Triage→03High [12:58:09] 06cloud-services-team, 10Cloud-VPS: Migrate codfw1dev hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T368426#9926284 (10taavi) a:05taavi→03Andrew [13:02:14] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services: Cloud VPS "packaging" project Buster deprecation - https://phabricator.wikimedia.org/T367544#9926316 (10Jelto) I created the bookworm host `packager-etherpad01.packaging.eqiad1.wikimedia.cloud` to replace `packager02.packaging.eqiad1.wikimedia.... [13:23:27] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9926375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin10... [13:34:41] RESOLVED: CloudVPSDesignateLeaks: Detected 13 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:35:27] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9926421 (10fnegri) The cookbook completed with PASS, but there were some errors in the DNS creation: ` 2024-06-26T13:19:11Z root ERROR : Zone a... [13:48:54] 06cloud-services-team, 10decommission-hardware: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536 (10Andrew) 03NEW [13:50:44] 06cloud-services-team, 10Cloud-VPS: Migrate codfw1dev hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T368426#9926491 (10Andrew) [13:53:08] 06cloud-services-team, 10Cloud-VPS, 10WikiCite: cloud-vps Trove instance 'wikicitations' shows host 'none' - https://phabricator.wikimedia.org/T368232#9926513 (10Andrew) 05Open→03Resolved a:03Andrew Thanks! I've deleted the instance. [14:01:04] 10Data-Services: [wikireplicas] wmcs-wikireplica-dns.py creates DNS records for private wikis - https://phabricator.wikimedia.org/T368538 (10fnegri) 03NEW [14:01:20] 10Data-Services: [wikireplicas] wmcs-wikireplica-dns.py creates DNS records for private wikis - https://phabricator.wikimedia.org/T368538#9926566 (10fnegri) p:05Triage→03Low [14:03:03] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9926578 (10fnegri) The second run of the cookbook completed successfully. Anything left to do in this task? [14:21:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:26:06] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:42:04] 10Data-Services, 10VPS-Projects: Request access to NFS mount /public/dumps for research-collaborations-api Cloud VPS project - https://phabricator.wikimedia.org/T368432#9926725 (10JJMC89) 05Resolved→03Invalid [14:43:23] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9926724 (10cmooney) >>! In T364870#9865334, @wiki_willy wrote: > Hi @dcaro - just following up on this. Can you provide the racking information for us, t... [14:45:37] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [14:45:40] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:45:43] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [14:45:46] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [14:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:45:59] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [14:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:50:51] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: Migrate WMCS managed projects to g4 flavors - https://phabricator.wikimedia.org/T367723#9926767 (10taavi) [14:52:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:58:53] 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536#9926847 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudvirt2001-dev.codfw.wmnet` - cloudvirt2... [15:08:05] 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536#9926870 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudvirt2002-dev.codfw.wmnet` - cloudvirt2... [15:09:31] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [15:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:09:38] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [15:15:58] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9926894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin10... [15:18:00] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [15:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:18:05] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [15:29:57] 06cloud-services-team, 10Cloud-VPS: Migrate codfw1dev hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T368426#9926949 (10Andrew) 05Open→03Resolved [15:46:48] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [15:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:46:54] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [15:54:16] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#9927154 (10dcaro) [15:54:19] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9927160 (10dcaro) >>! In T364870#9926724, @cmooney wrote: >>>! In T364870#9865334, @wiki_willy wrote: >> Hi @dcaro - just following up on this. Can you p... [16:11:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:13:11] (03PS9) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [16:13:11] (03CR) 10David Caro: ceph.osd.drain_node: force passing the cluster name (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 (owner: 10David Caro) [16:13:12] (03PS9) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [16:13:12] (03PS10) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [16:13:14] (03PS13) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) [16:15:21] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9927259 (10fnegri) Both clouddb1019 and clouddb1015 are still lagging behind. The query mentioned above by @Marostegui can possibly explain the lag in c... [16:16:20] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [16:24:26] (03PS1) 10JHathaway: postfix: mx domain aliases [labs/private] - 10https://gerrit.wikimedia.org/r/1049987 [16:26:54] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9927314 (10VRiley-WMF) 05Open→03In progress I am proceeding with moving the server physically. I will update this ticket once it's completed and updated... [16:30:03] PROBLEM - Host cloudvirt1063 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:47] FIRING: NodeDown: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [16:33:50] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [16:33:52] 06cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T368559 (10phaultfinder) 03NEW [16:39:35] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:39:47] (03CR) 10JHathaway: [C:03+2] postfix: mx domain aliases [labs/private] - 10https://gerrit.wikimedia.org/r/1049987 (owner: 10JHathaway) [16:39:48] (03CR) 10JHathaway: [V:03+2 C:03+2] postfix: mx domain aliases [labs/private] - 10https://gerrit.wikimedia.org/r/1049987 (owner: 10JHathaway) [16:43:23] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9927398 (10VRiley-WMF) 05In progress→03Open The server has been physically moved from U 42 to 33. No other changes happened (such as CableID) also, power... [16:49:17] RESOLVED: NodeDown: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [17:10:19] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:41:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:44:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [18:03:21] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Tool-spacemedia: Request quota increase for spacemedia project - https://phabricator.wikimedia.org/T368464#9927778 (10Don-vip) Thank you! [18:41:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:18:14] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [19:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:18:22] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [19:20:56] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9928109 (10Marostegui) >>! In T367778#9925706, @fnegri wrote: >> That query is pretty expensive and it is basically scanning 146M rows, which is unlikel... [20:31:48] 14MediaWiki-extensions-OpenStackManager, 06Diffusion-Repository-Administrators, 10Projects-Cleanup, 06translatewiki.net, 10Wikimedia-GitHub: Archive the OpenStackManager extension - https://phabricator.wikimedia.org/T367220#9928417 (10Pppery) 05Open→03Resolved [20:37:20] 10Toolforge, 10Phabricator, 10GitLab (Auth & Access): Look for ways to consolidate "we trust this human" access lists - https://phabricator.wikimedia.org/T364516#9928449 (10brennen) [21:05:41] 10Cloud-VPS (Debian Buster Deprecation), 06Infrastructure-Foundations, 06Release-Engineering-Team: Cloud VPS "integration" project Buster deprecation - https://phabricator.wikimedia.org/T367534#9928595 (10hashar) [21:07:08] 14Cloud-VPS (Debian Stretch Deprecation), 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071#9928598 (10hashar) [21:09:08] 10Cloud-VPS (Debian Buster Deprecation), 06Infrastructure-Foundations, 06Release-Engineering-Team: Cloud VPS "integration" project Buster deprecation - https://phabricator.wikimedia.org/T367534#9928606 (10hashar) [21:11:07] 10Cloud-VPS (Debian Buster Deprecation), 06Infrastructure-Foundations, 06Release-Engineering-Team: Cloud VPS "integration" project Buster deprecation - https://phabricator.wikimedia.org/T367534#9928610 (10hashar) [21:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:50:42] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9928727 (10bd808) >>! In T368136#9925649, @fnegri wrote: >> there's some data data there that we filter via the views and not only via sanitarium, bu... [21:57:57] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9928742 (10bd808) >>! In T368136#9924910, @Marostegui wrote: > So in terms of data, my recap is: > - non-public data (such as suppressed edits or ban... [22:18:44] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9928813 (10Andrew) 05Open→03Resolved thank you! I've put this back in service; we'll see if it cooks again. [22:48:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [23:07:51] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation): Replace all codfw1dev Buster VMs - https://phabricator.wikimedia.org/T368341#9928927 (10Andrew) all remaining buster VMs are now shutoff and can be deleted in a few days. [23:13:08] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "redirects" project Buster deprecation - https://phabricator.wikimedia.org/T367550#9928937 (10bd808) >>! In T367550#9894652, @Dzahn wrote: > So looks like there is some networking / firewall thing. But I see nothing that wouldn't match in the security group ("we... [23:22:31] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "redirects" project Buster deprecation - https://phabricator.wikimedia.org/T367550#9928944 (10Dzahn) >>! In T367550#9928937, @bd808 wrote: > I think the Horizon UI tricked you into thinking that the "web" security group had been applied to the new instance when... [23:24:30] !log andrew@cloudcumin1001 bastion START - Cookbook wmcs.openstack.migrate_server_to_ovs for server bastion-eqiad1-03 [23:24:32] !log andrew@cloudcumin1001 bastion END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server bastion-eqiad1-03 [23:25:12] !log andrew@cloudcumin1001 bastion START - Cookbook wmcs.openstack.migrate_server_to_ovs for server bastion-eqiad1-03 [23:25:52] !log andrew@cloudcumin1001 bastion END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server bastion-eqiad1-03 [23:27:13] !log andrew@cloudcumin1001 bastion START - Cookbook wmcs.openstack.migrate_server_to_ovs for server bastion-eqiad1-04 [23:28:13] !log andrew@cloudcumin1001 bastion END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server bastion-eqiad1-04 [23:28:27] !log andrew@cloudcumin1001 bastion START - Cookbook wmcs.openstack.migrate_server_to_ovs for server bastion-eqiad1-04 [23:28:30] !log andrew@cloudcumin1001 bastion END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server bastion-eqiad1-04 [23:29:47] 10cloud-services-team (Hardware): NodeDown (cloudvirt1063) - https://phabricator.wikimedia.org/T368007#9928957 (10Andrew) 05Open→03Resolved [23:30:33] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: Migrate WMCS managed projects to g4 flavors - https://phabricator.wikimedia.org/T367723#9928959 (10Andrew) [23:42:58] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "redirects" project Buster deprecation - https://phabricator.wikimedia.org/T367550#9928962 (10bd808) `lines=10 $ ssh redirects-nginx03.redirects.eqiad1.wikimedia.cloud $ for h in $(awk '/server_name/ && $2 !~ /_/ {gsub(/;/,"",$2); print $2}' /etc/nginx/sites-ena... [23:50:17] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "redirects" project Buster deprecation - https://phabricator.wikimedia.org/T367550#9928970 (10bd808) 05In progress→03Resolved [23:53:42] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "toolhub" project Buster deprecation - https://phabricator.wikimedia.org/T367556#9928975 (10bd808) 05Open→03In progress