[00:08:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:18:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:23] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10LucasWerkmeister) Well, it hasn’t turned itself group-readonly so far. [00:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:54:30] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:16:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [03:34:16] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10SD0001) Would be good to consolidate discussion in {T178520} - maybe we could switch directly to object storage. [03:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [05:16:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [05:34:12] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) >>! In T349690#9282576, @SD0001 wrote: > Would be good to consolidate discussion in {T178520} - maybe we could switch directly to object storage. Oh look at that, been a good idea for some time. [05:58:42] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:17:17] PROBLEM - Disk space on cloudbackup2001 is CRITICAL: DISK CRITICAL - free space: /srv/cinder-backups 3146014 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup2001&var-datasource=codfw+prometheus/ops [06:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:08:42] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:18:42] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:36:15] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) [07:36:29] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) p:05Triage→03Medium [08:05:55] 10Toolforge: I can't upload a file into my Toolforge folder - https://phabricator.wikimedia.org/T349797 (10MBH) [08:08:58] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/wmcs-k8s-metrics/-/merge_req... [08:09:46] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10CodeReviewBot) taavi updated https://gitlab.wikimedia.org/repos/cloud/toolforge/wmcs-k8s-metrics/-/merge_requests/4 chart: update cadvisor to 0.47.2 [08:16:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:20:20] 10Toolforge: I can't upload a file into my Toolforge folder - https://phabricator.wikimedia.org/T349797 (10taavi) Can you re-connect and try now? [08:21:38] 10Toolforge: I can't upload a file into my Toolforge folder - https://phabricator.wikimedia.org/T349797 (10MBH) It works now. What was the reason of the problem? [08:25:45] 10Toolforge: I can't upload a file into my Toolforge folder - https://phabricator.wikimedia.org/T349797 (10taavi) 05Open→03Resolved a:03taavi The service responsible for checking which groups an account is in (`sssd-nss.service`) had got stuck for whatever reason on the file server. I simply restarted it. [08:27:27] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10taavi) I noticed that `sssd-nss.service` had crashed on the file system server, that might have been causing this. Can you try now on a file that is group-writable? [08:36:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:36:10] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) 05Open→03Resolved a:03Magnus @taavi Success! I tried `/Users/mm6/php/magnustools/public_html/php/ToolforgeCommon.php` which had magically reverted to non-group-writable again, so I changed it t... [08:45:31] (03PS1) 10Jelto: gitlab_runner: add token for new authentication scheme [labs/private] - 10https://gerrit.wikimedia.org/r/968996 (https://phabricator.wikimedia.org/T344951) [08:48:19] (03CR) 10Jelto: [V: 03+2 C: 03+2] gitlab_runner: add token for new authentication scheme [labs/private] - 10https://gerrit.wikimedia.org/r/968996 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:07:52] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/wmcs-k8s-metrics/-/merge_requests/4 chart: update cadvisor to 0.47.2 [09:09:33] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [09:10:09] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [09:15:17] 10Cloud-VPS, 10cloud-services-team, 10User-fgiunchedi: Linting problems found for OpenstackAPIResponse - https://phabricator.wikimedia.org/T349801 (10fgiunchedi) [09:16:33] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [09:16:48] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [09:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:46:19] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node [09:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:20:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:31:37] (CephSlowOps) firing: Ceph cluster in eqiad has 8 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:35:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:36:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 15 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:44:37] (CephSlowOps) firing: Ceph cluster in eqiad has 695 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:44:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [10:49:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 695 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [11:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:18:42] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:33:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:11:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:16:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:48:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance syslog-server-audit02 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:49:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-docker-registry-05 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:56:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:03:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-docker-registry-02 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:04:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:13:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:14:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-docker-registry-05 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:22:57] 10Toolforge-standards-committee: Adoption request for geograph2commons - https://phabricator.wikimedia.org/T345707 (10bjh21) [13:48:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance toolsbeta-docker-registry-02 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:58:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:59:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-docker-registry-05 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:01:57] !log admin dcaro@urcuchillay END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [14:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:04:03] (PuppetAgentNoResources) resolved: (2) No Puppet resources found on instance tools-docker-registry-05 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:08:03] (PuppetAgentNoResources) resolved: (2) No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:09:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:14:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:19:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:05:37] 10Tools: Move development of `tool-versions.git` from Differential to Gerrit - https://phabricator.wikimedia.org/T252910 (10Aklapper) For the records, the codebase has been moved to GitLab in the meantime: https://gitlab.wikimedia.org/toolforge-repos/versions [15:13:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:27:32] 10Toolforge (Tools to be deleted), 10Projects-Cleanup, 10User-bd808: Delete tool recoin-sample - https://phabricator.wikimedia.org/T181541 (10Aklapper) For transparency: There is still a dangling empty repository at https://phabricator.wikimedia.org/diffusion/2178/ and I took the liberty to "break in" per ht... [15:31:01] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) New locations are as follows cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012 cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20... [15:33:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:34:30] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:09:34] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cumin and cloud-vps instances not working - https://phabricator.wikimedia.org/T347428 (10fnegri) [16:09:45] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cumin, 10Infrastructure-Foundations, 10Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453 (10fnegri) 05Open→03In progress [16:17:47] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10fnegri) 05In progress→03Resolved I have increased the limits, @mahmoud let us know if that helps. ` (venv)tools.montage@tools-sgebastion-10:~$ kubectl describe resourcequotas Name:... [16:39:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [16:39:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) [16:40:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [16:40:48] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) [16:40:57] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal, 10Patch-For-Review: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) [16:52:09] (03PS1) 10FNegri: upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 [16:58:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [17:05:17] 10Tools, 10WMDE-TechWish-Maintenance: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10Aklapper) [17:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:07:20] (03CR) 10FNegri: "I tested it with a dry run:" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 (owner: 10FNegri) [17:13:00] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) The SAL messages above generated by the cookbook `wmcs.openstack.cloudnet.reboot_node` were tagged incorrectly with this Phab task, they should have been tagge... [17:20:31] 10Cloud-VPS, 10cloud-services-team: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10cmooney) >>! In T349735#9281320, @Andrew wrote: > As far as I know there's no reason at all that these have 1G connections other than history and laziness. The only reason it would matter is if they l... [17:22:34] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) >>! In T346948#9284698, @VRiley-WMF wrote: > New locations are as follows > > cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012 > > c... [17:31:15] (03CR) 10Sohom Datta: "I agree on second thoughts this is definitely not the way to go. Will make a new patch wrt to adding a filter." [labs/striker] - 10https://gerrit.wikimedia.org/r/962144 (https://phabricator.wikimedia.org/T345776) (owner: 10Sohom Datta) [17:31:43] (03Abandoned) 10Sohom Datta: Use full url if provided in the suburl field [labs/striker] - 10https://gerrit.wikimedia.org/r/962144 (https://phabricator.wikimedia.org/T345776) (owner: 10Sohom Datta) [17:41:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:45:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:52:14] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) >>! In T346948#9284698, @VRiley-WMF wrote: > cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058 Thanks! I'm getting a duplicate cable ID ale... [18:13:35] 10cloud-services-team, 10MediaWiki-Engineering: Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10nskaggs) This is still something the NetOps folks would like to see happen. This ticket specifically is a place to list issues or blockers with... [18:44:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:50:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:53:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:30:55] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [19:38:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:42:08] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm executed with... [20:00:01] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [20:13:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:20:35] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10mahmoud) Great, thank you for the resources and tips! Will circle back if there are more reports of issues. Thanks again! [20:37:00] 10Cloud-VPS, 10cloud-services-team: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10Andrew) Refresh schedule: 2025: cloudcontrol1005 2024: cloudvirt-wdqs1001 2024: cloudvirt-wdqs1002 2024: cloudvirt-wdqs1003 [20:41:02] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:41:14] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [20:41:41] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:42:05] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [20:44:48] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:45:03] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [20:45:50] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm completed: - c... [20:54:50] 10Tool-ducttape, 10Abstract Wikipedia team: Provide mechanism for getting test artefacts out of pipeline - https://phabricator.wikimedia.org/T334228 (10SDunlap) 05Stalled→03Open [20:55:02] 10Tool-ducttape, 10Abstract Wikipedia team, 10Wikifunctions: New function orchestrator patches are tested with DUCT. - https://phabricator.wikimedia.org/T333191 (10SDunlap) 05In progress→03Open a:05SDunlap→03None [20:55:31] 10Tool-ducttape, 10Abstract Wikipedia team: Explore using Openstack Magnum - https://phabricator.wikimedia.org/T333381 (10SDunlap) 05Stalled→03Resolved [20:55:39] !log cloudvirt-canary taavi@runko START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [20:55:52] 10Tool-ducttape, 10Abstract Wikipedia team: Add Documentation to Wikitech or Mediawiki about DUCT - https://phabricator.wikimedia.org/T331756 (10SDunlap) 05Open→03Resolved [21:11:15] !log cloudvirt-canary taavi@runko START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [21:11:45] !log cloudvirt-canary taavi@runko END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [21:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [21:12:38] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:17:01] !log cloudvirt-canary taavi@runko START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [21:17:23] !log cloudvirt-canary taavi@runko END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [21:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [21:20:26] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Majavah still trying to figure out why Nova is not scheduling anything there https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:22:25] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:22:42] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [21:27:25] !log cloudvirt-canary taavi@runko START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [21:27:41] !log cloudvirt-canary taavi@runko END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [21:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudvirt-canary/SAL [21:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:49:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:55:10] 10Grid-Engine-to-K8s-Migration: Migrate ytcleaner from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320199 (10Mbch331) Looks like it's fixed. But need to check my cronjobs still work, now they've been migrated to k8s [22:49:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:06:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:16:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:38:42] (OpenstackAPIResponse) firing: (9) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse