[00:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:09:17] 10Toolforge Build Service (Beta release), 10cloud-services-team (FY2023/2024-Q1), 10Goal: Toolforge Build Service Beta Rollout To Selected Users - https://phabricator.wikimedia.org/T335249 (10komla) Shared the announcement draft with cloud admins [[ https://etherpad.wikimedia.org/p/build-service-open-beta |... [00:11:43] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:23:34] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:24:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:57:43] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:59:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [02:05:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [03:28:53] (CephClusterInWarning) firing: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:32:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [03:32:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [04:04:13] 10Toolforge: Add access control for Toolforge Elasticsearch - https://phabricator.wikimedia.org/T348943 (10SD0001) [04:09:05] 10Toolforge, 10Discovery-Search, 10Elasticsearch: Add access control for Toolforge Elasticsearch - https://phabricator.wikimedia.org/T348943 (10SD0001) [04:11:43] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:23:34] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [05:05:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [07:28:53] (CephClusterInWarning) firing: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [07:38:57] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) p:05Triage→03High [07:39:01] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) 05Open→03In progress [07:44:35] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Tried checking the size of the mon dat... [07:44:37] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) I'll try testarting the mons, that sho... [07:45:31] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) That seemed to help a bit, but still n... [07:50:02] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Hmm, after restarting all the mons thi... [07:51:08] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Oh, interesting, it freed a lot of spa... [07:52:38] 10Toolforge (Toolforge iteration 01): [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10Slst2020) >>! In T348750#9250192, @dcaro wrote: > this might work: > ` > script: | {{ .Files.Get "inject_buildpack.sh" | nindent 8}} > ` > > Like, on the same line It does work,... [07:54:47] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Ok, so it seems compaction is continui... [07:55:59] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Cluster is back healthy, and compactio... [07:58:38] (CephClusterInWarning) resolved: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:05:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:10:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:11:43] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:23:34] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:43:34] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:03:12] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Yep, back to normal on all mons: ` roo... [09:18:55] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/17 dev: Refactor shell scripts [09:20:17] 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup): High availability for the main cloud vps web proxy - https://phabricator.wikimedia.org/T316982 (10taavi) `lang=shell-session taavi@cloudcontrol1007 ~ $ os floating ip set --port f596fb10-6294-4cb5-b3c0-a1a61a8a2b24 ddd2114e-fa18-435f-a... [09:23:30] 10Toolforge (Toolforge iteration 01): decide on which kubernetes bootstrapper to focus on between minikube and kind - https://phabricator.wikimedia.org/T347723 (10Slst2020) +1 for switching to kind [09:25:04] (03PS1) 10David Caro: mypy: skip build directory [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966132 [09:25:11] (03PS1) 10David Caro: alerts: don't fail if host already downtimed or uptimed [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966133 [09:25:15] (03PS1) 10David Caro: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) [09:25:19] (03PS1) 10David Caro: ceph: Adapt to multi-level crush tree [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966135 (https://phabricator.wikimedia.org/T331145) [09:25:23] (03PS1) 10David Caro: ceph: add drain/undrain host and rack cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966136 (https://phabricator.wikimedia.org/T329709) [09:27:55] (03Abandoned) 10David Caro: ceph: adapt to rack level HA in the tree [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/894708 (https://phabricator.wikimedia.org/T331145) (owner: 10David Caro) [09:27:58] (03Abandoned) 10David Caro: ceph: add drain and undrain node cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/960081 (owner: 10David Caro) [09:28:32] (03CR) 10CI reject: [V: 04-1] alerts: don't fail if host already downtimed or uptimed [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966133 (owner: 10David Caro) [09:28:34] (03CR) 10CI reject: [V: 04-1] ceph: Adapt to multi-level crush tree [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966135 (https://phabricator.wikimedia.org/T331145) (owner: 10David Caro) [09:28:40] 10Toolforge (Toolforge iteration 01): find an alternative to Vagrant - https://phabricator.wikimedia.org/T348960 (10Slst2020) [09:28:41] (03CR) 10CI reject: [V: 04-1] mypy: skip build directory [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966132 (owner: 10David Caro) [09:28:46] (03CR) 10CI reject: [V: 04-1] ceph: add drain/undrain host and rack cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966136 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [09:28:57] (03CR) 10CI reject: [V: 04-1] openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [09:29:02] 10Cloud-VPS, 10cloud-services-team (Kanban): cloud vps web proxy is down - https://phabricator.wikimedia.org/T316975 (10taavi) [09:29:04] 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup): High availability for the main cloud vps web proxy - https://phabricator.wikimedia.org/T316982 (10taavi) 05Open→03Resolved [09:32:38] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) Everything is stable again, I think th... [09:32:43] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space - https://phabricator.wikimedia.org/T348951 (10dcaro) 05In progress→03Resolved [09:40:47] 10Toolforge (Toolforge iteration 01): [tbs][builder] Add shellcheck to pre-commit - https://phabricator.wikimedia.org/T348961 (10Slst2020) [09:49:43] (03CR) 10FNegri: openstack: don't pass the new project when creating it (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [09:56:54] 10Toolforge (Toolforge iteration 01): [tbs][builder] Add shellcheck to pre-commit - https://phabricator.wikimedia.org/T348961 (10Slst2020) 05Open→03In progress a:03Slst2020 [10:03:36] (03CR) 10David Caro: openstack: don't pass the new project when creating it (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [10:20:38] 10cloud-services-team (FY2023/2024-Q1), 10wikitech.wikimedia.org: [wikitech] administrator rights for WMCS - https://phabricator.wikimedia.org/T347557 (10fnegri) p:05Triage→03Low [10:29:01] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account Adamham - https://phabricator.wikimedia.org/T348663 (10taavi) Without access to your SSH keys verifying this request is a bit trickier. Since it seems like you've interacted with WMDE folks... [10:32:26] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account Adamham - https://phabricator.wikimedia.org/T348663 (10Botoxparty) @conny-kawohl_WMDE I will send an email to ca@wikimedia.org and CC you [10:35:22] 10Toolforge, 10Toolforge-standards-committee: Adoption request for Item Quality Evaluator - https://phabricator.wikimedia.org/T348968 (10karapayneWMDE) [10:37:47] 10Toolforge, 10Toolforge-standards-committee: Adoption request for Item Quality Evaluator - https://phabricator.wikimedia.org/T348968 (10dcaro) 05Open→03Resolved a:03dcaro Manually added @karapayneWMDE as manitainer after double-checking the identity and situation. [10:46:14] 10Cloud-VPS, 10Documentation: Restructure and improve content for: https://wikitech.wikimedia.org/wiki/Help:Sudo_Policies - https://phabricator.wikimedia.org/T233669 (10taavi) [10:47:48] 10Toolforge, 10Elasticsearch: Add access control for Toolforge Elasticsearch - https://phabricator.wikimedia.org/T348943 (10taavi) [11:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [11:41:03] 10Toolforge Jobs framework, 10cloud-services-team, 10Pywikibot, 10Patch-For-Review, 10User-Raymond_Ndibe: Create Docker image for Toolforge that is purpose built to run pywikibot scripts - https://phabricator.wikimedia.org/T249787 (10taavi) This topic was discussed in the latest Toolforge admin meeting.... [12:11:43] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:28:19] 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) > Do you expect the Bullseye package to work in Bookworm without the patches you mentioned? Yes. [12:37:00] 10Toolforge (Toolforge iteration 01): Upgrade harbor - https://phabricator.wikimedia.org/T346241 (10Slst2020) a:03Slst2020 [12:37:39] 10Toolforge (Toolforge iteration 01): Upgrade harbor - https://phabricator.wikimedia.org/T346241 (10Slst2020) 05Open→03In progress [12:38:24] 10Toolforge (Toolforge iteration 01): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) a:05Slst2020→03None [12:43:34] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:58:34] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:59:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) > That list is maintained upstream, we can try sending a patch to remove the invalid options. I created a patch upstream:... [13:28:40] 10Toolforge (Toolforge iteration 01): Upgrade harbor - https://phabricator.wikimedia.org/T346241 (10Slst2020) @dcaro, when you upgraded Harbor last time, did you go through all these tests? https://goharbor.io/docs/2.9.0/administration/upgrade/upgrade-test/ [13:39:24] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:50:13] 10Toolforge (Toolforge iteration 01): Upgrade harbor - https://phabricator.wikimedia.org/T346241 (10dcaro) We did not, we did try though adding a user, creating a project, making sure that the existing ones still work, and doing a full tool build + deploy. There's also a few features being tested there that we d... [13:58:51] 10Toolforge-standards-committee: Adoption request for Item Quality Evaluator - https://phabricator.wikimedia.org/T348968 (10JJMC89) [14:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [14:08:51] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) 05Open→03In progress [14:08:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) a:05fnegri→03dcaro [14:08:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) I created a [bug report upstream](https://storyboard.openstack.org/#!/story/2010942) suggesting Trove should respect the va... [14:08:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) [14:10:55] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) There's several layers here, I'll try to fix locally first, and in CI later. Locally it seems to fail with: ` Collecting pyyaml (from wmcs-cookbooks==0.1.d... [14:22:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) Using pyyaml==5.3.1 seemed to work past that error. Now the next is with elasticsearch-curator, that is pulled by wikimedia-spicerack: ` Collecting elasti... [14:34:26] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) This is related to {T345337} also [14:34:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10User-dcaro: [wmcs-cookbooks] tox is failing - https://phabricator.wikimedia.org/T348726 (10dcaro) [14:54:46] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) 05In progress→03Resolved I updated to [[ https://wikitech.wikimedia.org/wiki/Help:Trove_database_user_guide | Trove hel... [15:00:20] 10Cloud-VPS (Quota-requests), 10linkwatcher: Quota increase for linkwatcher - https://phabricator.wikimedia.org/T348441 (10taavi) +1 [15:02:56] 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/1 [15:02:56] vivian-rook closed https://github.com/toolforge/quarry/pull/1 [15:37:42] 10cloud-services-team: Need baremetal system(s) with internet access - https://phabricator.wikimedia.org/T349003 (10rook) [15:37:55] 10cloud-services-team: bare metal deploy poc - https://phabricator.wikimedia.org/T348461 (10rook) 05Open→03Stalled [15:37:57] 10cloud-services-team: [research] kolla-ansible poc - https://phabricator.wikimedia.org/T348457 (10rook) [15:41:13] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) a:03dcaro [15:42:47] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) [16:11:43] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:14:33] 10Cloud-VPS (Quota-requests), 10Infrastructure-Foundations, 10Puppet CI: Request Addtional resources for puppet-diffs project - https://phabricator.wikimedia.org/T349006 (10jbond) [16:15:31] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) @Jclark-ctr do you need the logs as specified here (https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshoo... [16:26:31] 10Toolforge, 10Toolforge-standards-committee, 10cloud-services-team, 10Security-Team, and 2 others: Standard process for dealing with public OAuth consumer secrets - https://phabricator.wikimedia.org/T348752 (10sbassett) [16:28:43] 10Tool-bub2, 10Internet-Archive, 10Outreach-Programs-Projects, 10Outreachy (Round 27): For PDL, download and stream the PDF if available - https://phabricator.wikimedia.org/T348188 (10Okerekechinweotito) I have made a PR that fixes this issue PR here - https://github.com/coderwassananmol/BUB2/pull/224 [16:41:17] 10Cloud Services Proposals: Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 (10dcaro) I think that for both option 1 and 2, the current SRE incident guidelines has too many things we might not need (ex. incident coordinator, updating wikimediastatus.net, calling an sre di... [17:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [17:43:35] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:58:33] 10cloud-services-team (Hardware), 10SRE, 10ops-codfw, 10User-dcaro: cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10Papaul) @nskaggs hello true that codfw will me moving to the EVPN/VXLAN design but codfw doesn't have that many racks to dedicate 2 r... [18:43:01] 10Toolforge, 10Toolforge-standards-committee, 10cloud-services-team, 10Security-Team, and 2 others: Standard process for dealing with public OAuth consumer secrets - https://phabricator.wikimedia.org/T348752 (10LucasWerkmeister) Sounds good to me. The document / guideline in question could also suggest som... [19:05:14] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2001 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:05:20] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [20:06:25] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [20:06:38] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [20:06:48] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) [20:06:57] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [20:07:01] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [20:08:37] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) [20:08:52] 10Quarry, 10Patch-For-Review: Create minikube deploy for quarry - https://phabricator.wikimedia.org/T301469 (10rook) [20:11:44] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:35:33] 10cloud-services-team (FY2023/2024-Q1), 10wmcs-retrospective: Realign and agree on the team social norms. - https://phabricator.wikimedia.org/T327725 (10nskaggs) 05Open→03Resolved I think this ticket can now be closed. The conversation around the social norms can and should continue, with more edits and re... [21:16:36] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10cmooney) >>! In T336854#9235377, @fgiunchedi wrote: > pdns auth can't be scraped of cour... [21:43:35] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:46:29] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:51:29] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:27:06] 10Tool-Pageviews, 10Data-Engineering: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) [22:27:23] 10Tool-Pageviews, 10Data-Engineering: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) p:05Medium→03Unbreak! Raising to UBN as per duplicate task [22:28:11] 10Tool-Pageviews, 10Data-Engineering, 10Data Products (Sprint 02): None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) [22:30:03] 10Tool-Pageviews, 10Data-Engineering, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) [22:31:42] 10Tool-Pageviews, 10Data-Engineering, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) a:03Sfaci Sorry for all the noise! I didn't realize until after merging the old task was assigned etc. [23:04:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [23:11:29] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:13:35] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:48:36] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse