[00:12:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:17:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:54:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:22:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:09:41] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:43:11] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:23:05] (03PS1) 10Samwilson: Upgrade dependencies etc. [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/972082 [04:25:09] (03CR) 10Samwilson: [C: 03+2] Upgrade dependencies etc. [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/972082 (owner: 10Samwilson) [04:25:38] (03CR) 10Samwilson: "The failure here is related to Symfony Flex, which I'm removing in Id182f90c67a44a337289a296e2105ab50cbe2d3e" [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/971743 (owner: 10VolkerE) [04:25:55] (03Merged) 10jenkins-bot: Upgrade dependencies etc. [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/972082 (owner: 10Samwilson) [04:56:31] (03PS3) 10Samwilson: build, styles: Replace WikimediaUI Base with Codex design tokens [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/971743 (owner: 10VolkerE) [05:10:58] (03CR) 10Samwilson: [C: 03+2] build, styles: Replace WikimediaUI Base with Codex design tokens [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/971743 (owner: 10VolkerE) [05:11:42] (03Merged) 10jenkins-bot: build, styles: Replace WikimediaUI Base with Codex design tokens [labs/tools/meetingtimes] - 10https://gerrit.wikimedia.org/r/971743 (owner: 10VolkerE) [05:49:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:22:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:36:26] 10Tool-Pageviews, 10Data-Engineering, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) 05Open→03Resolved Also fixed for the original report https://pageviews.wmcloud.org/mediaviews/?pro... [07:02:21] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:13:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:22:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:43:11] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:17:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:42:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:49:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:02:55] 10Toolforge: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10taavi) [10:02:59] 10Toolforge: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) [10:03:01] 10Toolforge: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10taavi) [10:03:03] 10Toolforge: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) [10:03:15] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10taavi) [10:08:14] 10Toolforge, 10cloud-services-team: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) a:03taavi [10:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:43:56] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/5 Make... [10:49:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:14:41] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:43:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:43:34] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10BTullis) 05Open→03Resolved I've run the cookbook again and the DNS step has now completed, so it must have been a transient failure. Resolving this ticket. [11:44:17] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10BTullis) a:03BTullis [11:44:27] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10BTullis) a:03BTullis [11:44:36] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10BTullis) a:03BTullis [11:47:12] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10BTullis) 05Open→03Resolved This is now working, including the DNS alias. ` btullis@tools-sgebastion-10:~$ sql bbcwiki Reading table information for completion of table a... [11:48:21] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10BTullis) 05Open→03Resolved This is now complete. ` btullis@tools-sgebastion-10:~$ sql bjnwikiquote Reading table information for completion of table and column name... [11:49:46] 10Data-Services, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10BTullis) 05Open→03Resolved This is now complete. ` btullis@tools-sgebastion-10:~$ sql dgawiki Reading table information for completion of table and column names You can... [13:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:25:54] 10VPS-Projects, 10WMDE-TechWish-Maintenance-2023: Scraper: destroy Cloud VPS runner instance - https://phabricator.wikimedia.org/T345411 (10awight) Done. We still have the raw reports stored on an 80GiB detached volume: https://horizon.wikimedia.org/project/volumes/ec65493b-e06c-446a-8aa8-7e4df54ee7fd/ [13:56:04] 10Toolforge Build Service (Beta release), 10cloud-services-team (FY2022/2023-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: [buildservice-api] Create a build POST endpoint to start a new build - https://phabricator.wikimedia.org/T337218 (10Raymond_Ndibe) [13:56:08] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [builds-api] catch harbor timeout when creating repository - https://phabricator.wikimedia.org/T345903 (10Raymond_Ndibe) 05Resolved→03Open [13:56:12] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [builds-api] catch harbor timeout when creating repository - https://phabricator.wikimedia.org/T345903 (10Raymond_Ndibe) 05In progress→03Resolved [13:56:15] 10Toolforge Build Service (Beta release), 10cloud-services-team (FY2022/2023-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: [buildservice-api] Create a build POST endpoint to start a new build - https://phabricator.wikimedia.org/T337218 (10Raymond_Ndibe) [13:56:23] 10Toolforge (Toolforge iteration 02): [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Raymond_Ndibe) 05Resolved→03Open [13:56:58] 10Toolforge (Toolforge iteration 02): [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Raymond_Ndibe) 05In progress→03Resolved [14:13:27] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10taavi) 05Open→03In progress [14:13:31] 10Toolforge, 10cloud-services-team: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) [14:18:37] (CephSlowOps) firing: Ceph cluster in eqiad has 92 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:18:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [14:20:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:23:19] (03CR) 10Btullis: [C: 03+1] "Feel free to self-merge without a +1 on the labs/private repo, unless you're specifically requesting a review for any reason." [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [14:23:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 50 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:25:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:27:38] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) [14:28:33] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) 05Open→03In progress [14:31:50] 10Toolforge: Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687 (10taavi) [14:38:49] 10Toolforge Build Service (Beta release): [buildservice] Bug - .m2 folder (local maven repository) is not cached between builds - https://phabricator.wikimedia.org/T350307 (10Slst2020) > As per documentation the .m2 folder (local maven repository) should be cached between builds, so the new build shouldn't spend... [14:42:27] 10Toolforge: [tbs] Explore adding caching support - https://phabricator.wikimedia.org/T350689 (10Slst2020) [14:42:46] 10Toolforge: [tbs] Explore adding caching support - https://phabricator.wikimedia.org/T350689 (10Slst2020) [14:42:48] 10Toolforge Build Service (Beta release): [buildservice] Bug - .m2 folder (local maven repository) is not cached between builds - https://phabricator.wikimedia.org/T350307 (10Slst2020) [14:44:01] 10Toolforge Build Service (Beta release): [buildservice] Cache .m2 folder (local maven repository) between builds - https://phabricator.wikimedia.org/T350307 (10Slst2020) [14:44:24] 10Toolforge Build Service (Beta release): [buildservice] Cache .m2 folder (local maven repository) between builds - https://phabricator.wikimedia.org/T350307 (10Slst2020) [14:46:58] 10Toolforge: find an alternative to Vagrant - https://phabricator.wikimedia.org/T348960 (10Slst2020) [14:47:15] 10Toolforge: [tbs][dev] find an alternative to Vagrant - https://phabricator.wikimedia.org/T348960 (10Slst2020) [14:48:19] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 02): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10Slst2020) 05In progress→03Stalled [14:48:36] 10Toolforge (Toolforge iteration 02): Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10Slst2020) 05Open→03Resolved [14:48:59] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [builds-api] catch harbor timeout when creating repository - https://phabricator.wikimedia.org/T345903 (10Slst2020) 05Open→03Resolved [14:49:05] 10Toolforge Build Service (Beta release), 10cloud-services-team (FY2022/2023-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: [buildservice-api] Create a build POST endpoint to start a new build - https://phabricator.wikimedia.org/T337218 (10Slst2020) [14:49:27] 10Toolforge (Toolforge iteration 02): [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Slst2020) 05Open→03Resolved [14:49:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:50:25] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) 05Open→03In progress [14:51:58] 10Toolforge (Toolforge iteration 02): [tbs][dev] decide on which kubernetes bootstrapper to focus on between minikube and kind - https://phabricator.wikimedia.org/T347723 (10Slst2020) [14:53:13] 10Toolforge: [tbs][dev] decide on which kubernetes bootstrapper to focus on between minikube and kind - https://phabricator.wikimedia.org/T347723 (10Slst2020) [14:54:03] 10Toolforge (Toolforge iteration 02): Publish a blog post about buildservice on the Tech Blog - https://phabricator.wikimedia.org/T350691 (10Slst2020) [14:55:08] 10Toolforge (Toolforge iteration 02), 10Documentation: Create an ASGI tutorial for buildservice - https://phabricator.wikimedia.org/T350692 (10Slst2020) [15:13:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bookworm [15:16:19] (HAProxyBackendUnavailable) firing: HAProxy service designate-api_backend backend cloudservices1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:18:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:37:05] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10Andrew) a:03Andrew Hello @jsn.sherman. There seem to be a few races/weak points in the code that manages volume attachment so sometimes things get stuck in an incons... [15:40:26] 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet CI: puppet catalog compiler (pcc) failing with internal error - https://phabricator.wikimedia.org/T347358 (10jbond) Adding a bit of background currently the pcc-workers talk to puppetdb via an nginx instance local to the host which proxies conncet... [15:43:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:45:44] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10jsn.sherman) >>! In T350586#9312988, @Andrew wrote: > Hello @jsn.sherman. There seem to be a few races/weak points in the code that manages volume attachment so someti... [15:51:27] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10Andrew) What about the 'wikilink-backup' volume? Leave as-is? [16:02:39] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10jsn.sherman) >>! In T350586#9313075, @Andrew wrote: > What about the 'wikilink-backup' volume? Leave as-is? yep, I was able to make changes to `wikilink-backup` last... [16:04:15] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10Andrew) I deleted wikilink-nfs. I detached and expanded docker-data-root. I did not reattach it because there are few fun bits left for you. - Attaching. You can do... [16:05:18] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10Andrew) a:05Andrew→03jsn.sherman [16:08:20] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10jsn.sherman) >>! In T350586#9313136, @Andrew wrote: > I deleted wikilink-nfs. > > I detached and expanded docker-data-root. Thanks! > I did not reattach it because the... [16:12:10] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10jsn.sherman) oh, I somehow missed that we had programmatic access! https://wikitech.wikimedia.org/wiki/Help:Using_OpenStack_APIs [16:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:20:11] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10Andrew) >>! In T350586#9313151, @jsn.sherman wrote: >>>! In T350586#9313136, @Andrew wrote: >> I deleted wikilink-nfs. >> >> I detached and expanded docker-data-root.... [16:25:50] 10cloud-services-team: clouddb1019 memory alert - https://phabricator.wikimedia.org/T346826 (10Marostegui) clouddb1015 is now alerting on this too. [16:28:58] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10jsn.sherman) This is looking good on initial boot. I'm going to let data collection catch up and then give it a reboot to verify that things are happy. [16:53:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:54:19] (HAProxyBackendUnavailable) firing: HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:54:40] (GaleraDown) firing: Galera/MariaDB down on cloudcontrol1005:9104 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraDown - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraDown [16:54:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:58:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:59:19] (HAProxyBackendUnavailable) resolved: HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:59:40] (GaleraDown) resolved: Galera/MariaDB down on cloudcontrol1005:9104 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraDown - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraDown [16:59:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:08:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Host rebooted by fnegri@cumin1001 with reason: Rebooting to test if everything works after the reimage and pdns setup [17:11:04] (HAProxyBackendUnavailable) firing: (2) HAProxy service designate-api_backend backend cloudservices1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:16:04] (HAProxyBackendUnavailable) resolved: HAProxy service designate-api_backend backend cloudservices1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:21:56] PROBLEM - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.015 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:56] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.015 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:56] PROBLEM - Check DNS auth via UDP of tools-sgegrid-master.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.014 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:57] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.012 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:57] PROBLEM - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.010 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:57] PROBLEM - Check DNS auth via TCP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.013 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:57] PROBLEM - Check DNS auth via TCP of tools-sgegrid-master.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.014 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:21:58] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 0.015 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:22:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:22:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bookworm completed: - cloudservices... [17:22:46] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) [17:22:52] 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10RhinosF1) 05In progress→03Resolved [17:23:07] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) We need to look at best way for future and importing new wikis [17:23:40] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats needs improved data and presentation for fandom - https://phabricator.wikimedia.org/T215534 (10RhinosF1) [17:32:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:47:23] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): cinder volumes stuck in detaching, deleting states - https://phabricator.wikimedia.org/T350586 (10jsn.sherman) 05Open→03Resolved looks good on reboot; thanks @Andrew! [18:49:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:54:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:04:05] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:06:05] (03CR) 10BryanDavis: [C: 03+2] dev(Makefile): Prefer Docker Compose v2 [labs/striker] - 10https://gerrit.wikimedia.org/r/970853 (owner: 10BryanDavis) [19:09:58] (03Merged) 10jenkins-bot: dev(Makefile): Prefer Docker Compose v2 [labs/striker] - 10https://gerrit.wikimedia.org/r/970853 (owner: 10BryanDavis) [19:12:42] (03CR) 10BryanDavis: Use full url if provided in the suburl field (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/962144 (https://phabricator.wikimedia.org/T345776) (owner: 10Sohom Datta) [19:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:15:58] 10Striker, 10Patch-For-Review: Concatenated URLs in toolinfo.json - https://phabricator.wikimedia.org/T345776 (10bd808) >>! In https://gerrit.wikimedia.org/r/c/labs/striker/+/962144, @bd808 wrote: >>>! In https://gerrit.wikimedia.org/r/c/labs/striker/+/962144, @soda wrote: >> Btw, a lot of these URLs are from... [19:16:08] (03CR) 10BryanDavis: [C: 03+2] dev: Bump GitLab container to v16.3.6 [labs/striker] - 10https://gerrit.wikimedia.org/r/970854 (owner: 10BryanDavis) [19:16:31] (03CR) 10BryanDavis: [C: 03+2] gitlab: Handle error response JSON decode failures gracefully [labs/striker] - 10https://gerrit.wikimedia.org/r/970855 (owner: 10BryanDavis) [19:18:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:18:43] (03Merged) 10jenkins-bot: dev: Bump GitLab container to v16.3.6 [labs/striker] - 10https://gerrit.wikimedia.org/r/970854 (owner: 10BryanDavis) [19:19:42] (03Merged) 10jenkins-bot: gitlab: Handle error response JSON decode failures gracefully [labs/striker] - 10https://gerrit.wikimedia.org/r/970855 (owner: 10BryanDavis) [19:41:35] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.011 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:42:21] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.013 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:42:29] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.020 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:42:31] RECOVERY - Check DNS auth via TCP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.012 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:42:35] RECOVERY - Check DNS auth via TCP of tools-sgegrid-master.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.013 seconds response time (tools-sgegrid-master.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.5.129) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:43:19] RECOVERY - Check DNS auth via UDP of tools-sgegrid-master.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.045 seconds response time (tools-sgegrid-master.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.5.129) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:48:13] RECOVERY - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.014 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:11] RECOVERY - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.013 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:01] (03CR) 10Incola: [V: 03+1 C: 03+2] SQL query code fix after "revision" table updates starting from MediaWiki 1.35 [labs/tools/lists] - 10https://gerrit.wikimedia.org/r/972028 (owner: 10Mess) [20:01:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10Andrew) I updated the docs, but here is the sticky bit about 'master' records in the pdns db on cloudservices nodes: > You will also need to update pdns on all node... [21:25:00] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:14:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:32:39] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_req... [22:37:39] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_req... [22:49:28] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:18:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:43:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange