[00:28:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:35:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [01:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [01:48:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:53:33] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:03:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:08:26] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:28:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:35:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [04:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [04:48:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [05:53:33] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:05:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance tools-sgeweblight-10-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [06:28:57] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [06:38:22] 10Cloud-VPS, 10cloud-services-team: Remove the WMCS statsd/Graphite service - https://phabricator.wikimedia.org/T326266 (10hashar) [06:38:45] 10Cloud-VPS, 10cloud-services-team: Remove the WMCS statsd/Graphite service - https://phabricator.wikimedia.org/T326266 (10hashar) >>! In T326266#9203951, @dancy wrote: > https://gerrit.wikimedia.org/r/960576 broke puppet on `deploy-1004.devtools.eqiad1.wikimedia.cloud`. > > ` > Error while evaluating a Resou... [06:56:31] 10Cloud-VPS, 10cloud-services-team: Remove the WMCS statsd/Graphite service - https://phabricator.wikimedia.org/T326266 (10hashar) [07:08:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:17:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [07:18:23] 10Tool-bub2: npm install not working - https://phabricator.wikimedia.org/T348833 (10Spykelionel) [07:21:19] 10Tool-bub2: npm install not working - https://phabricator.wikimedia.org/T348833 (10Spykelionel) p:05Triage→03Low [07:45:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgeweblight-10-26 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:48:57] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:49:03] (InstanceDown) firing: Project tools instance tools-sgeweblight-10-26 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:54:03] (InstanceDown) resolved: Project tools instance tools-sgeweblight-10-26 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:20:55] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285) [08:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:21:01] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [08:27:56] (ToolsGridQueueProblem) firing: (2) Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [08:30:50] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285) [08:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:30:57] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [08:31:18] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285) [08:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:39:49] (TfInfraTestDestroyFailed) firing: (2) Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [08:41:22] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285) [08:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:41:28] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [08:42:03] (InstanceDown) firing: Project tools instance tools-sgeexec-10-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:44:49] (TfInfraTestDestroyFailed) firing: (2) Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [08:49:04] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [08:50:22] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10dcaro) p:05Triage→03High [08:51:10] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10dcaro) I can't access the switch to investigate, @cmooney can you give it a look? [08:53:46] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10dcaro) From logstash: {F38220146} [08:59:21] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10cmooney) cloudservices2004-dev restarted: ` cmooney@cloudservices2004-dev:~$ uptime 08:58:22 up 27 min, 1 us... [08:59:49] (TfInfraTestDestroyFailed) firing: (2) Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:01:21] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10cmooney) Outage duration: ` Oct 13 08:27:26 cloudsw1-b1-codfw l2cpd[11290]: LLDP_NEIGHBOR_DOWN: A neighbor of... [09:02:03] (InstanceDown) resolved: Project tools instance tools-sgeexec-10-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:02:08] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:04:49] (TfInfraTestDestroyFailed) firing: (2) Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:06:55] !log toolsbeta dcaro@urcuchillay START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [09:06:58] !log toolsbeta dcaro@urcuchillay END (ERROR) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=97) [09:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:07:11] !log tools dcaro@urcuchillay START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [09:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:07:20] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [09:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:08:29] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10dcaro) Thanks @cmooney, yes this was a controlled reboot of one of the cloudservices, will create a task to au... [09:09:49] (ToolsGridQueueProblem) resolved: (2) Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [09:14:45] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs][builder] Inject nodejs buildpack - https://phabricator.wikimedia.org/T346635 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/re... [09:16:29] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [wmcs-cookbooks] add a cookbook to reboot a cloudservices host - https://phabricator.wikimedia.org/T348841 (10dcaro) [09:17:41] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [wmcs-cookbooks] add a cookbook to reboot a cloudservices host - https://phabricator.wikimedia.org/T348841 (10dcaro) [09:18:08] 10cloud-services-team, 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [cloudsw1] BGP alert and port alert flapping - https://phabricator.wikimedia.org/T348839 (10dcaro) 05Open→03Resolved a:03dcaro [09:18:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) OpenStack .deb packages have now been upgraded to Antelope (using the cookbooks `upgrade_openstack_node` and `live_upgrade_openstack`) on all c... [09:20:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) [09:21:08] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [wmcs-cookbooks] add a cookbook to reboot a cloudservices/cloudlb host - https://phabricator.wikimedia.org/T348841 (10dcaro) [09:22:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [wmcs-cookbooks] add a cookbook to reboot a cloudservices/cloudlb host - https://phabricator.wikimedia.org/T348841 (10fnegri) [09:24:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) [09:24:20] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [09:24:26] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [09:24:34] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10dcaro) Some aspects I see for each option: == Option 1 == Pros: * We keep using python of which we might have more experience with * A lot of the current code can be reused Cons: *... [09:24:42] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [09:26:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) We now want to test that everything works fine in codfw, before proceeding with upgrading eqiad. I created two sub-tasks for the eqiad work:... [09:48:12] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-builder [09:48:29] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-builder [09:51:02] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10Slst2020) Agree with @dcaro's points, and also: == Option 1 == Pros: * Reuse of code would likely speed up the migration * Allows us to do a gradual migration, merging one CLI at a... [09:53:01] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10Slst2020) [09:53:33] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:15:29] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs][builder] Inject nodejs buildpack - https://phabricator.wikimedia.org/T346635 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/re... [10:16:29] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) Codfw equivalent subnet that needs changing also: ` cmooney@cloudcontrol2005-dev:~$ sudo wmcs-openstack subnet show 2596edb4-5a40-... [10:18:13] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.653% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:20:43] 10Toolforge Jobs framework: toolforge-jobs – wikihistory needs a container with both php7 and mono - https://phabricator.wikimedia.org/T305780 (10Slst2020) [10:20:50] 10Cloud Services Proposals, 10Toolforge Build Service, 10cloud-services-team, 10Cloud-Services-Origin-Team, and 3 others: [Epic] Make Toolforge a proper platform as a service with push-to-deploy and build packs - https://phabricator.wikimedia.org/T194332 (10Slst2020) [10:21:22] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs] User story - I can use multiple language stacks for my application - https://phabricator.wikimedia.org/T325799 (10Slst2020) 05In progress→03O... [10:21:46] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs] User story - I can use multiple language stacks for my application - https://phabricator.wikimedia.org/T325799 (10Slst2020) a:05Slst2020→03No... [10:23:26] 10Toolforge (Toolforge iteration 01): [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10Slst2020) 05Open→03In progress [10:25:21] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196 (10Ladsgroup) >>! In T268196#9247842, @kostajh wrote: >>>! In T268196#8922105, @hashar wrote: >> If we don't want to rely on GitHub search, then I guess codesearc... [10:38:13] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.936% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:49:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [11:04:16] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196 (10kostajh) >>! In T268196#9249327, @Ladsgroup wrote: >>>! In T268196#9247842, @kostajh wrote: >>>>! In T268196#8922105, @hashar wrote: >>> If we don't want to re... [11:08:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:16:05] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs] User story - I can use multiple language stacks for my application - https://phabricator.wikimedia.org/T325799 (10dcaro) @Slst2020 I think we ca... [11:18:39] 10Toolforge Jobs framework: toolforge-jobs – wikihistory needs a container with both php7 and mono - https://phabricator.wikimedia.org/T305780 (10Slst2020) [11:18:50] 10Cloud Services Proposals, 10Toolforge Build Service, 10cloud-services-team, 10Cloud-Services-Origin-Team, and 3 others: [Epic] Make Toolforge a proper platform as a service with push-to-deploy and build packs - https://phabricator.wikimedia.org/T194332 (10Slst2020) [11:19:17] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [tbs] User story - I can use multiple language stacks for my application - https://phabricator.wikimedia.org/T325799 (10Slst2020) 05Open→03Resolved... [11:52:08] 10Toolforge Jobs framework: toolforge-jobs – wikihistory needs a container with both php7 and mono - https://phabricator.wikimedia.org/T305780 (10Slst2020) >>! In T305780#9246229, @taavi wrote: > The toolforge build service should make this possible: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Servi... [12:04:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:44:06] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) p:05Low→03High This is causing some issues, should be fixed sooner than later, bumping priority [12:50:24] 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) [13:15:20] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review: [buildservice] Create GET /build/latest endpoint in the buildservice API - https://phabricator.wikimedia.org/T345675 (10CodeReviewBot) dcaro m... [13:24:00] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review: [buildservice] Create GET /build/latest endpoint in the buildservice API - https://phabricator.wikimedia.org/T345675 (10CodeReviewBot) dcaro o... [13:26:27] !log toolsbeta dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [13:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:26:58] !log toolsbeta dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [13:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:28:42] !log tools dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [13:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:14] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [13:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:31:49] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review: [buildservice] Create GET /build/latest endpoint in the buildservice API - https://phabricator.wikimedia.org/T345675 (10CodeReviewBot) dcaro m... [13:32:45] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review: [buildservice] Create GET /build/latest endpoint in the buildservice API - https://phabricator.wikimedia.org/T345675 (10dcaro) 05In progress... [13:40:19] 10Toolforge (Toolforge iteration 01): [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10Slst2020) @dcaro This is driving me nuts xd. ` - name: inject-buildpacks image: "{{ .Values.imagesSource.bashImage }}" args: [] env: - name: WORKSPACE_OUT... [13:40:41] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: [builds-cli] Use the API to retrieve the latest build - https://phabricator.wikimedia.org/T348866 (10dcaro) [13:40:50] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project: [builds-cli] Use the API to retrieve the latest build - https://phabricator.wikimedia.org/T348866 (10dcaro) 05Open→03In progress [13:40:53] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review: [buildservice] Create GET /build/latest endpoint in the buildservice API - https://phabricator.wikimedia.org/T345675 (10dcaro) [13:42:42] 10Toolforge (Toolforge iteration 01), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review: [builds-cli] Use the API to retrieve the latest build - https://phabricator.wikimedia.org/T348866 (10CodeReviewBot) dcaro opened https://gitla... [13:49:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:53:35] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:02:33] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10Jhancock.wm) [14:03:12] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10Jhancock.wm) 05Open→03Resolved it's been 3 days. good enough for me. [14:12:54] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [15:04:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:05:10] 10Toolforge (Toolforge iteration 01): [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10dcaro) I think that the exec format error if for the extra newline at the start of the script, the shebang needs to be the first line. I have not been able to reproduce locally (getting... [15:05:41] 10Cloud-VPS: update github action - https://phabricator.wikimedia.org/T348873 (10rook) [15:06:51] 10Toolforge (Toolforge iteration 01): [tbs][builder] Refactor task yaml template - https://phabricator.wikimedia.org/T348750 (10dcaro) :facepalm: I had a typo in the file name... this might work: ` script: | {{ .Files.Get "inject_buildpack.sh" | nindent 8}} ` Like, on the same line [15:06:55] 10PAWS: update build-and-push action - https://phabricator.wikimedia.org/T348874 (10rook) [15:08:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:09:13] 10PAWS: update build-and-push action - https://phabricator.wikimedia.org/T348874 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/339 [15:09:16] vivian-rook opened https://github.com/toolforge/paws/pull/339 [15:15:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [15:17:17] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196 (10dancy) Something like `curl https://gitlab.wikimedia.org/api/v4/groups/186/projects?include_subgroups=true` returns the first page of projects under `/repos`.... [15:18:54] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [15:30:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10JJMC89) Many clients use a UTF-8 encoding by default, but that doesn't affect the character set. I should be able to specify the ch... [15:37:20] 10Toolforge (Toolforge iteration 01): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10dcaro) > Allows us to do a gradual migration, merging one CLI at a time That can be done with the option 2 also (the current modular design allows for easy plucking of each cli), a... [16:29:50] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [16:33:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:34:59] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [16:35:46] 10cloud-services-team (Hardware), 10DC-Ops, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) 05In progress→03Resolved [16:36:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [16:42:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:43:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [16:49:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:50:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) > What I did find out from testing with the revised configuration group is that I cannot create the database when creating... [17:03:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:05:14] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [17:05:53] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10Continuous-Integration-Config, and 2 others: [ci,operations-puppet] upgrade to tox 4 in order to detect changed requirement files - https://phabricator.wikimedia.org/T345152 (10hashar) There is a fe... [17:17:32] 10Tool-openstack-browser: openstack-browser: LIst public object storage buckets - https://phabricator.wikimedia.org/T348884 (10taavi) [17:20:41] 10Cloud-VPS, 10Data-Services, 10cloud-services-team, 10User-Marostegui: Horizon Object Storage UI should not display for readers - https://phabricator.wikimedia.org/T348885 (10Andrew) [17:23:36] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:33:05] 10Cloud Services Proposals: Decision Request - Incident response process - https://phabricator.wikimedia.org/T348887 (10fnegri) [17:35:09] 10Cloud Services Proposals: Decision Request - Incident response process - https://phabricator.wikimedia.org/T348887 (10fnegri) Please consider this as a draft that we can improve together. Feel free to suggest additional pros/cons, or to say that you don't agree with something I wrote in the description. :) [17:38:28] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:40:40] 10Cloud Services Proposals: Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 (10fnegri) [17:44:49] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance quarry-web-02 in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:48:28] (OpenstackAPIResponse) resolved: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:04:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:46:26] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10mahmoud) [18:53:44] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10mahmoud) 05Open→03In progress Actually, I just did a closer read of the docs, and Montage isn't using all of its allocated pods. I just bumped it up myself and will monitor performa... [18:54:39] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10taavi) Hi. I think you can add more replicas with the `--replicas` webservice flag, the default quota is 10 pods so you should be able to fit some extra replicas comfortably there. Righ... [19:00:09] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10mahmoud) Yeah, the Python 3 backend is already deployed over on the montage-dev tool. Peak traffic isn't an ideal time to migrate. And yeah, I tried running more replicas, but I guess... [19:19:37] (CephClusterInWarning) firing: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:24:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:24:37] (CephClusterInWarning) resolved: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:26:04] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [19:28:37] (CephClusterInWarning) firing: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:40:23] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196 (10Ladsgroup) >>! In T268196#9249422, @kostajh wrote: > I think we would want all of them? In which case, we could use https://docs.gitlab.com/ee/api/repositories... [19:49:28] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196 (10brennen) > Are we sure we want to scan every gitlab repo? Stuff like toolforge repos, debian repos, upstream repos we have vendored are a lot. I don't have a... [19:49:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:49:49] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance quarry-web-02 in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:04:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:24:22] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:28:51] (03PS1) 10Gergő Tisza: Add missing library repos from https://doc.wikimedia.org/#libraries [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965809 [22:49:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [23:10:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [23:12:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [23:28:21] (03PS1) 10Gergő Tisza: Add more missing library repos from Iae28fa6b31 [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965841 [23:28:37] (CephClusterInWarning) firing: Ceph cluster in is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [23:29:10] (03CR) 10CI reject: [V: 04-1] Add more missing library repos from Iae28fa6b31 [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965841 (owner: 10Gergő Tisza) [23:31:54] (03PS2) 10Gergő Tisza: Add more missing library repos from Iae28fa6b31 [labs/codesearch] - 10https://gerrit.wikimedia.org/r/965841 [23:37:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10JJMC89) > As a workaround, you should be able to use [ALTER DATABASE](https://mariadb.com/kb/en/alter-database/) after the database... [23:49:49] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance quarry-web-02 in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun