[00:35:35] 10Grid-Engine-to-K8s-Migration: Migrate deltaquad-bots from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319668 (10AmandaNP) 05Open→03Resolved Grid engine tasks have been disabled. [00:46:04] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:48:01] 10Grid-Engine-to-K8s-Migration: Migrate fastilybot-reports from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319741 (10Fastily) @nskaggs Do you know if the cron (i.e. `crontab`) functionality of toolforge tool accounts is going away with the deprecation of Grid Engine? Thanks! [02:45:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [02:54:57] 10Tools, 10Chinese-Sites: zhdeletionpedia tool violates the Toolforge database connection handling policy - https://phabricator.wikimedia.org/T353556 (10Shizhao) [02:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:34:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:34:59] 10Tools: Make the TemplateCount tool work on all wikis - https://phabricator.wikimedia.org/T353607 (10Frostly) See also https://linkcount.toolforge.org/, which has this functionality. [03:46:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:50:22] 10Grid-Engine-to-K8s-Migration: Migrate enboten from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319721 (10komla) @Lejonel don't hesitate to share any specific issues you might be having with the migration process here. The team will assist [03:50:24] 10Grid-Engine-to-K8s-Migration: Migrate enboten from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319721 (10komla) @Lejonel don't hesitate to share any specific issues you might be having with the migration process here. The team will assist [03:50:36] 10Grid-Engine-to-K8s-Migration: Migrate enboten from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319721 (10komla) @Lejonel don't hesitate to share any specific issues you might be having with the migration process here. The team will assist [03:57:37] 10Grid-Engine-to-K8s-Migration: Migrate fastilybot-reports from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319741 (10komla) @Fastily cron jobs functionality is still available with similar syntax. See: [[ https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Cre... [05:28:31] 10Grid-Engine-to-K8s-Migration: Migrate fastilybot-reports from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319741 (10komla) @Fastily You might also find the use [[ https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#Use_case_continuity | case continui... [05:42:57] 10Grid-Engine-to-K8s-Migration: Migrate fastilybot-reports from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319741 (10Fastily) @komla I'm aware of those, but does this mean that running `crontab` from the command line in tool accounts is going away? I have [[ https://github... [05:45:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [05:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:46:04] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:34:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:01:18] 10Cloud-VPS, 10cloud-services-team, 10SRE, 10ops-eqiad: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10ayounsi) Nice ! So next step here is to decom the current ones and then sync up with DCops to move them to the proper racks. From there we can re-pr... [08:45:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [08:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:46:04] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:10:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:48:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:52:11] 10cloud-services-team, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q2): Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457 (10fgiunchedi) Thank you for looking into this @fnegri ! I've looked and the alertmanager configuration and the u... [10:53:21] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:25:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [11:34:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:45:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [11:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:46:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:50:36] 10Cloud Services Proposals, 10Toolforge Build Service (Iteration 11): Decision request - What buildpacks to allow and include for toolforge build service beta - https://phabricator.wikimedia.org/T330102 (10taavi) Yes, but that seems to come with the cost of increased complexity and risk (having to update the v... [13:58:30] 10Toolforge, 10cloud-services-team: Relocate disable-tool-archive-dbs.service - https://phabricator.wikimedia.org/T353642 (10taavi) [14:20:41] 10Cloud-Services: Install rabbitmq-server security updates on cloudrabbit* hosts - https://phabricator.wikimedia.org/T353646 (10MoritzMuehlenhoff) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with... [14:21:48] 10Cloud-VPS, 10cloud-services-team: Install rabbitmq-server security updates on cloudrabbit* hosts - https://phabricator.wikimedia.org/T353646 (10taavi) [14:28:43] 10Grid-Engine-to-K8s-Migration: Migrate smallem from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320048 (10Klein) I'm the maintainer of this tool. Currently I'm trying to migrate my jobs into Kubernetes but I'm encountering technical difficulties. I'm trying to solve them wi... [14:29:35] 10Toolforge (Quota-requests): Request increased quota for cewbot, toc, signature-checker, mgp-cewbot Toolforge tool - https://phabricator.wikimedia.org/T353104 (10fnegri) +1 [14:38:46] 10Toolforge, 10cloud-services-team: Relocate disable-tool-archive-dbs.service - https://phabricator.wikimedia.org/T353642 (10Andrew) a:03Andrew [14:39:50] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [14:40:07] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [14:40:42] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased quota for cewbot, toc, signature-checker, mgp-cewbot Toolforge tool - https://phabricator.wikimedia.org/T353104 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/159 maintain-k... [14:40:52] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased quota for cewbot, toc, signature-checker, mgp-cewbot Toolforge tool - https://phabricator.wikimedia.org/T353104 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/159 maintain-k... [14:42:22] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased quota for cewbot, toc, signature-checker, mgp-cewbot Toolforge tool - https://phabricator.wikimedia.org/T353104 (10taavi) a:03taavi I've deployed most of those, except: > mgp-cewbot What is moegirlpedia (from the [[ https://toolsadmin.wikim... [14:42:49] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [14:43:01] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [14:45:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [14:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:01:38] (CephSlowOps) firing: Ceph cluster in eqiad has 14 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:01:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [15:05:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:06:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:07:56] (ToolsToolsDBReplicationMissing) firing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [15:07:56] (ToolsToolsDBReplicationError) firing: ToolsDB replication is broken on tools-db-2 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [15:10:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:22:56] (ToolsToolsDBReplicationError) resolved: ToolsDB replication is broken on tools-db-2 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [15:22:56] (ToolsToolsDBReplicationMissing) resolved: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [15:25:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [15:26:48] 10Cloud-VPS, 10cloud-services-team: Install rabbitmq-server security updates on cloudrabbit* hosts - https://phabricator.wikimedia.org/T353646 (10Andrew) I've upgraded codfw1dev. In-place package upgrade locked up so in eqiad1 I'll need to stop, remove, and install the package. [15:29:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [15:30:11] 10Grid-Engine-to-K8s-Migration: Migrate fastilybot-reports from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319741 (10taavi) >>! In T319741#9412019, @Fastily wrote: > @komla I'm aware of those, but does this mean that using/running `crontab` from the command line in tool acc... [15:31:38] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] MariaDB process is killed by OOM killer (December 2023) - https://phabricator.wikimedia.org/T353093 (10fnegri) [15:31:40] 10Data-Services: [toolsdb] Upgrade to MariaDB 10.6 - https://phabricator.wikimedia.org/T352206 (10fnegri) [15:31:54] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] MariaDB process is killed by OOM killer (December 2023) - https://phabricator.wikimedia.org/T353093 (10fnegri) 05In progress→03Resolved [15:31:57] 10Toolforge, 10Patch-For-Review: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/6 kubernetes: capacity: ignore finished pods [15:32:00] 10Toolforge, 10Patch-For-Review: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/6 kubernetes: capacity: ignore finished pods [15:34:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:46:04] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:50:36] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581 (10taavi) 05Open→03Resolved [17:19:37] PROBLEM - Check systemd state on cloudrabbit1003 is CRITICAL: CRITICAL - degraded: The following units failed: rabbitmq_detect_partition.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:33] (SystemdUnitDown) firing: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:23:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:24:08] RECOVERY - Check systemd state on cloudrabbit1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:33] (SystemdUnitDown) resolved: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:29:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [17:50:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [17:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:13:50] 10Grid-Engine-to-K8s-Migration: Migrate canary from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319614 (10komla) Deleted tool [18:14:02] 10Grid-Engine-to-K8s-Migration: Migrate canary from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319614 (10komla) 05Open→03Resolved [18:28:06] 10Cloud-VPS, 10cloud-services-team: Install rabbitmq-server security updates on cloudrabbit* hosts - https://phabricator.wikimedia.org/T353646 (10Andrew) 05Open→03Resolved [18:46:04] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:49:18] 10Tool-inteGraality: Integraality fails to load the page linking to SPARQL queries when it's requesting a qualifier, not a value - https://phabricator.wikimedia.org/T353667 (10Harmonia_Amanda) [19:31:59] 10Cloud-VPS (Quota-requests), 10Tool-spacemedia: disk quota increase (+200 GB) for spacemedia Cloud VPS project - https://phabricator.wikimedia.org/T353670 (10Don-vip) [19:34:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:51:18] 10Cloud-VPS (Quota-requests), 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10dancy) [19:52:09] 10Cloud-VPS (Quota-requests), 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10dancy) @Jelto @dduvall FYI [20:50:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [20:54:37] (CephSlowOps) firing: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [20:54:41] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [20:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:59:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [21:11:04] 10Cloud-VPS (Quota-requests), 10cloud-services-team, 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10bd808) p:05Triage→03Medium +1 [21:15:16] 10Cloud-VPS (Quota-requests), 10Tool-spacemedia, 10cloud-services-team: disk quota increase (+200 GB) for spacemedia Cloud VPS project - https://phabricator.wikimedia.org/T353670 (10bd808) p:05Triage→03Medium +1 [21:37:14] 10Cloud-Services, 10WM-Bot: !help : update wmcs-kanban towards cloud-services-team - https://phabricator.wikimedia.org/T353580 (10bd808) Thank you for spotting this issue and making us aware @LD. I fear that most of us who patrol the `#wikimedia-cloud` irc channel are "banner blind" to the message that `!help`... [21:44:53] 10Cloud-VPS (Quota-requests), 10cloud-services-team, 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10Andrew) a:03Andrew [21:51:04] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:51:58] !log andrew@cloudcumin1001 devtools START - Cookbook wmcs.openstack.quota_increase (T353671) [21:52:01] !log andrew@cloudcumin1001 devtools END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) (T353671) [21:52:03] T353671: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 [21:53:03] !log andrew@cloudcumin1001 devtools START - Cookbook wmcs.openstack.quota_increase [21:53:06] !log andrew@cloudcumin1001 devtools END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) [21:55:21] 10Cloud-VPS (Quota-requests), 10cloud-services-team, 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10Andrew) 05Open→03Resolved I couldn't make the cookbook cooperate but I adjusted the quotas vi the openstack cli. [22:17:39] 10Cloud-VPS (Quota-requests), 10cloud-services-team, 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10Andrew) I also boosted your instance count by 2 in case you need space to shuffle things. Hopefully in-place resiz... [22:18:31] 10Cloud-VPS (Quota-requests), 10cloud-services-team, 10GitLab-Test, 10Release-Engineering-Team: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 (10dancy) Thanks a lot! I just tried an in-place resize for zuul-1001 and it worked great! [23:34:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:50:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [23:55:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed