[00:07:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:17:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:37:04] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:49:30] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:58:40] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:59:29] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:03:40] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:38:28] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:48:40] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:53:40] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:59:30] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:03:40] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:04:29] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:18:28] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:18:40] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:24:29] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:37:04] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:47:12] 10Cloud-VPS, 10Data-Services, 10cloud-services-team, 10Patch-For-Review, 10User-Marostegui: Support Trove + Swift integration - https://phabricator.wikimedia.org/T349651 (10Andrew) Backups seem to work properly with mysql databases. Mariadb backups appear to succeed but are reported as failed -- that loo... [04:58:41] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:02:48] 10tool-wscontest: Use calendars to publicize contests - https://phabricator.wikimedia.org/T349678 (10PMenon-WMF) [06:37:04] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [06:38:44] 10Tools: extreg-wos: Reflection is listed, but repo isn't available on Gerrit - https://phabricator.wikimedia.org/T295522 (10Kizule) Also, some extensions aren't listed there at all. Like AddThis, which will be eventually archived, but I'm wondering why it isn't listed right now as well. [06:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:31:30] 10Toolforge (Toolforge iteration 01), 10cloud-services-team: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10dcaro) 05Open→03Resolved [08:23:41] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:32:04] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:33:41] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:13:19] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 (10Slst2020) [09:36:22] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 (10Slst2020) [09:36:49] 10Toolforge (Toolforge iteration 01), 10Patch-For-Review: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 (10Slst2020) 05In progress→03Resolved [09:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:49:31] 10Toolforge (Toolforge iteration 01): [tbs][harbor] Improve Harbor admin docs - https://phabricator.wikimedia.org/T349313 (10Slst2020) 05Open→03Resolved Docs can always be improved but I think we can declare this as {{done}} for now. Thank you @dcaro for adding information about the deployment layout and adm... [09:50:04] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) [09:50:24] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) [09:52:03] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) FWIW, I can edit the files if I `become` the tool, but I don't like editing in shell, especially for larger code. [09:52:17] 10Toolforge: Standardize Toolforge CLI user interface looks - https://phabricator.wikimedia.org/T348442 (10Slst2020) [09:55:47] 10Toolforge: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668 (10Slst2020) Is this related to possibly moving Harbor to a Helm deployment? [09:58:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [10:02:03] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) [10:02:28] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) [10:02:32] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [10:08:19] 10Toolforge: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668 (10dcaro) Not really, it's simpler than that, it's about having some availability with the current setup. About moving to helm deployment, we would have to consider also if using the same k8s cluster than toolfor... [10:10:38] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the toolsbeta cluster [10:14:04] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] Unable disk failure prediciton - https://phabricator.wikimedia.org/T349694 (10dcaro) p:05Triage→03High [10:14:26] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] Unable disk failure prediciton - https://phabricator.wikimedia.org/T349694 (10dcaro) [10:14:28] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716 (10dcaro) [10:15:15] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716 (10dcaro) So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have... [10:18:13] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716 (10taavi) >>! In T348716#9279100, @dcaro wrote: > So it seems that ceph does not export the disk metrics it coll... [10:18:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [10:26:03] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a ingress role in the toolsbeta cluster [10:27:15] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-test-k8s-ingress-6 [10:27:22] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-test-k8s-ingress-6 [10:28:23] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the toolsbeta cluster [10:32:02] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) [10:34:21] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a ingress role in the toolsbeta cluster [11:01:05] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-test-k8s-ingress-6 [11:01:13] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-test-k8s-ingress-6 [11:01:21] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the toolsbeta cluster [11:06:38] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10LucasWerkmeister) > The file has group write permissions Not as far as I can tell: `lang=shell-session lucaswerkmeister@tools-sgebastion-10:~$ ls -l /data/project/wdrc/public_html/api.php -rw-r--r-- 1 tool... [11:09:43] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a ingress role in the toolsbeta cluster [11:38:41] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:48:41] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:10:40] 10Toolforge (Toolforge iteration 01): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10Slst2020) > This task is to figure out a way to be able cleanup those images If taking a batch approach to cleanup as @Raymond_Ndibe suggests, how often would this need to be done, r... [12:29:53] 10Toolforge (Toolforge iteration 01): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10dcaro) I don't even think that we need to do any harbor downtime, the images that are affected by this are our own system-toolforge images, we can just disable momentarily the immuata... [12:32:53] 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716 (10dcaro) >>! In T348716#9279122, @taavi wrote: >>>! In T348716#9279100, @dcaro wrote: >> So it seems that ceph... [12:35:14] 10Toolforge (Toolforge iteration 01): [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10dcaro) Not really :) The idea is to automate it on gitlab side, instead of us having to run a script [12:36:14] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) Odd. Now it does. Same issue. [12:38:07] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) Other test case: /Users/mm6/php/magnustools/public_html/php/ToolforgeCommon.php [12:40:15] 10Toolforge (Toolforge iteration 01): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10dcaro) [12:40:47] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi) [12:41:42] 10Toolforge (Toolforge iteration 01): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10Raymond_Ndibe) a:03Raymond_Ndibe [12:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:46:09] 10Toolforge: Display webservice logs when start fails - https://phabricator.wikimedia.org/T349703 (10taavi) [12:46:46] 10Toolforge: Display webservice logs when start fails - https://phabricator.wikimedia.org/T349703 (10taavi) This was suggested by @Asaf today on IRC. [12:46:55] 10Toolforge (Toolforge iteration 01), 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10Raymond_Ndibe) [12:47:45] 10Toolforge (Toolforge iteration 01): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10Slst2020) Hmm, how does the updated task description mesh with the wish expressed earlier in this discussion to not expose users to the "image" abstract... [12:51:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Related documentation mentioned by @dcaro in IRC: https://mariadb.com/kb/en/mariadb-memory-allocation/ [12:51:48] 10Toolforge (Toolforge iteration 01): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10Slst2020) More generally, we might be reaching the limits of "winging it" in terms of UI/UX. I know you are working on consolidating our roadmap @dcaro.... [12:55:22] 10Toolforge (Toolforge iteration 01): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10dcaro) >>! In T341067#9279590, @Slst2020 wrote: > Hmm, how does the updated task description mesh with the wish expressed earlier in this discussion to... [12:56:19] 10Toolforge (Toolforge iteration 01): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10dcaro) >>! In T341067#9279639, @dcaro wrote: >>>! In T341067#9279590, @Slst2020 wrote: >> Hmm, how does the updated task description mesh with the wish... [12:58:41] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:23:41] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:27:17] 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi) a:03taavi [13:44:59] 10Tools: Wikidata ID link results in 504 Gateway Time-out - https://phabricator.wikimedia.org/T349712 (10Aklapper) 05Open→03Invalid Hi @Thosbsamsgom, thanks for taking the time to report this! The resulting page says: > This URI is managed by the wikidata-externalid-url tool, maintained by ArthurPSmith. Yo... [13:51:37] (CephSlowOps) firing: Ceph cluster in eqiad has 8 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:51:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [13:53:57] 10Toolforge (Toolforge iteration 02): find an alternative to Vagrant - https://phabricator.wikimedia.org/T348960 (10dcaro) [13:54:00] 10Toolforge (Toolforge iteration 02): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10dcaro) [13:54:02] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [webservice] Error shown when restarting buildpack-based tool - https://phabricator.wikimedia.org/T348312 (10dcaro) [13:54:04] 10Toolforge (Toolforge iteration 02): [tbs] migrate sample tools to Gitlab - https://phabricator.wikimedia.org/T348213 (10dcaro) [13:54:06] 10Toolforge (Toolforge iteration 02): decide on which kubernetes bootstrapper to focus on between minikube and kind - https://phabricator.wikimedia.org/T347723 (10dcaro) [13:54:08] 10Toolforge (Toolforge iteration 02), 10Documentation, 10Kubernetes: [buildservice] Add docs on how to run a ruby based tool using buildpacks - https://phabricator.wikimedia.org/T347402 (10dcaro) [13:54:10] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10dcaro) [13:54:12] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10dcaro) [13:54:14] 10Toolforge (Toolforge iteration 02): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10dcaro) [13:54:16] 10Toolforge (Toolforge iteration 02): `webservice restart` sometimes timing out for buildservice images - https://phabricator.wikimedia.org/T341057 (10dcaro) [13:54:20] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: [toolforge-envvars.api,toolforge-build.api] Support flagging environment variables to be injected at build time - https://phabricator.wikimedia.org/T338142 (10dcaro) [13:54:22] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [builds-api] Add triggering support - https://phabricator.wikimedia.org/T334587 (10dcaro) [13:54:26] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: tbs: user-story 10: I want to know how to manage the service - https://phabricator.wikimedia.org/T325166 (10dcaro) [13:54:28] 10Toolforge (Toolforge iteration 02): Expose tool-labs service names via environment variables - https://phabricator.wikimedia.org/T151002 (10dcaro) [13:54:32] 10Toolforge (Toolforge iteration 02), 10Documentation, 10Kubernetes: Add a easy way to run a ruby webservice on tools - https://phabricator.wikimedia.org/T141388 (10dcaro) [13:54:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:54:48] 10Toolforge (Toolforge iteration 02): Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10dcaro) [13:54:54] 10Toolforge (Toolforge iteration 02): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10dcaro) [13:55:03] (InstanceDown) firing: Project tools instance tools-sgebastion-11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:55:18] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [tbs.maintain-harbor] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176 (10dcaro) [13:55:42] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10dcaro) [13:55:48] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.build.logs] Show a more user-friendly error message when logs are not ready - https://phabricator.wikimedia.org/T341059 (10dcaro) [13:56:01] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [builds-api] Automatically deploy the webservice when the image is built - https://phabricator.wikimedia.org/T341065 (10dcaro) [13:56:10] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-dcaro: `toolforge build logs`: add follow options - https://phabricator.wikimedia.org/T339922 (10dcaro) [13:56:19] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: toolforge build start: default to tailing the build as it progresses with the option of -d/--detached - https://phabricator.wikimedia.org/T340079 (10dcaro) [13:56:21] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10Patch-For-Review, 10User-dcaro: [builds-api] catch harbor timeout when creating repository - https://phabricator.wikimedia.org/T345903 (10dcaro) [13:56:33] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10dcaro) [13:57:03] (InstanceDown) firing: Project cloudinfra instance cloud-puppetmaster-05 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:57:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-puppetdb-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:57:15] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [builds-api.start] Add statistics - https://phabricator.wikimedia.org/T337390 (10dcaro) [14:00:03] (InstanceDown) firing: (2) Project tools instance tools-redis-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:04:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:05:03] (InstanceDown) resolved: (2) Project tools instance tools-redis-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:06:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance k8s-test-nfs in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:06:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 11 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:07:03] (InstanceDown) resolved: Project cloudinfra instance cloud-puppetmaster-05 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:07:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-puppetdb-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:01:27] 10Toolforge (Quota-requests): Request increased quota for Montage Toolforge tool - https://phabricator.wikimedia.org/T348894 (10taavi) >>! In T348894#9269025, @fnegri wrote: >> I'd definitely be interested in increasing resources > > @taavi can I get your +1 on doubling the CPU and Memory quotas for this tool?... [15:15:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) a:03fnegri [15:15:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) p:05Triage→03High [15:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:03:16] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [webservice shell] Allow a user to delete/stop all running shell pods - https://phabricator.wikimedia.org/T349733 (10dcaro) p:05Triage→03Medium [16:11:01] 10Cloud-Services: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10cmooney) p:05Triage→03Medium The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific proj... [16:11:53] 10Cloud Services Proposals: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10cmooney) [16:12:41] 10Cloud-VPS, 10cloud-services-team: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10taavi) [16:17:34] 10Cloud-VPS, 10cloud-services-team: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10cmooney) [16:19:10] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi) The above patches make it possible to provision a new host on bookworm. There are a couple of issues... [16:56:32] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) @Jclark-ctr hi! Yes, we can schedule, though we can't take many hosts at the same time, so will have to be done little by litt... [17:03:41] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:14:06] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10LucasWerkmeister) >>! In T349687#9279528, @Magnus wrote: > Other test case: /Users/mm6/php/magnustools/public_html/php/ToolforgeCommon.php Also not group-readable: `lang=shell-session lucaswerkmeister@tool... [17:15:00] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10LucasWerkmeister) wait, wrong command 🤦 [17:16:19] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10LucasWerkmeister) `lang=shell-session tools.wdrc@tools-sgebastion-10:~$ { ls -l /data/project/wdrc/public_html/api.php; while sleep 1m; do ls -l /data/project/wdrc/public_html/api.php; done; } | tee T349687-... [18:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:00:11] 10Toolforge (Toolforge iteration 01), 10cloud-services-team: Weird error HTTP 405 Method Not Allowed on Toolforge - https://phabricator.wikimedia.org/T349452 (10Albertoleoncio) 05Resolved→03Open [19:02:14] 10Cloud-VPS, 10cloud-services-team: Cloud-hosts connected at 1G - https://phabricator.wikimedia.org/T349735 (10Andrew) As far as I know there's no reason at all that these have 1G connections other than history and laziness. The only reason it would matter is if they literally don't have 10G nics which I doubt. [19:26:51] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro that works for me let me know what 5 are available tomorrow and i will start. Thanks! [20:16:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [21:03:41] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:13:41] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:15:46] 10Tool-Pageviews, 10Data-Engineering, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Ladsgroup) Thanks. From my contacts, this seems to be fixed now. [21:44:50] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:52:46] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) [21:53:30] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) These hosts have four drives will be fine with just one SW raid, so "partman/standard.cfg partman/raid10-4dev.cfg" look... [21:53:42] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:54:20] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) [21:54:41] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Andrew) Note that I also changed the distro to Bookworm. We're currently upgrading all our existing hosts to Bookworm. [22:36:42] 10Toolforge Jobs framework: Internal server error from "toolforge jobs logs" - https://phabricator.wikimedia.org/T349775 (10bjh21) [23:16:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [23:40:28] 10Tool-Pageviews, 10Data-Engineering, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10KAP_Jasa) thank you all :-)