[00:16:28] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 47 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:21:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720987 (10Papaul) [00:21:28] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:23:42] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9721003 (10Papaul) @Jhancock.wm anything else left to be done on this task? [00:23:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720988 (10Papaul) 05Open→03Resolved Complete [01:00:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt2004-dev.codfw.wmnet' (T356287) [01:00:52] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [01:05:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt2004-dev.codfw.wmnet' (T356287) [01:35:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt2005-dev.codfw.wmnet' (T356287) [01:35:37] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [01:40:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt2005-dev.codfw.wmnet' (T356287) [01:43:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt2006-dev.codfw.wmnet' (T356287) [01:43:18] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [01:48:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt2006-dev.codfw.wmnet' (T356287) [01:48:18] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [01:53:02] 10Tools: AFD stats update delay - https://phabricator.wikimedia.org/T362732#9721105 (10Bugreporter) [01:54:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [01:58:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [03:24:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 50 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:29:03] 10Tools: AFD stats update delay - https://phabricator.wikimedia.org/T362732#9721333 (10Aklapper) @Liz: Quarry and bots are not AFD? [06:37:15] 10Data-Services: AFD stats update delay - https://phabricator.wikimedia.org/T362732#9721345 (10JJMC89) This is expected due to replica database lag from T352010. See https://replag.toolforge.org. [06:47:08] 10Tools, 05Community-Wishlist-Survey-2023, 03Wikimedia Wishathon: Investigate Dabfix tool implementation - https://phabricator.wikimedia.org/T336545#9721392 (10srishakatux) Updates on the progress made during Wishathon (March 2024) by @Soda and @Gopavasanth : https://meta.wikimedia.org/wiki/Community_Tech/W... [06:51:22] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [07:15:42] (03PS1) 10Muehlenhoff: Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1020624 (https://phabricator.wikimedia.org/T360636) [07:18:27] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1020624 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [07:24:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:49:16] 10Toolforge (Toolforge iteration 09): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9721502 (10dcaro) [07:49:37] 10Toolforge (Toolforge iteration 09): [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#9721506 (10dcaro) [07:49:38] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9721508 (10dcaro) [07:49:48] 10Toolforge (Toolforge iteration 09), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [maintain-harbor,docs] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176#9721504 (10dcaro) [07:49:51] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09): [builds-api] Add dashboards with the new statistics - https://phabricator.wikimedia.org/T352764#9721510 (10dcaro) [07:50:37] 10Toolforge (Toolforge iteration 09): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9721512 (10dcaro) [07:50:48] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#9721520 (10dcaro) [07:50:49] 10Toolforge (Toolforge iteration 09): [toolforge-cli,jobs-cli,builds-cli,envvars-cli] Explore OpenAPI SDK tooling for client consolidation - https://phabricator.wikimedia.org/T356261#9721518 (10dcaro) [07:51:03] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#9721524 (10dcaro) [07:51:07] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9721526 (10dcaro) [07:51:15] 10Toolforge (Toolforge iteration 09), 07Documentation: [harbor,docs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092#9721527 (10dcaro) [07:51:19] 10Toolforge (Toolforge iteration 09): [maintain-kubeusers] Increment default services quota - https://phabricator.wikimedia.org/T362520#9721528 (10dcaro) [07:51:24] 10Toolforge (Toolforge iteration 09): remove "File log:" column from toolforge jobs list -o long output - https://phabricator.wikimedia.org/T361896#9721529 (10dcaro) [07:51:28] 10Toolforge (Toolforge iteration 09): [jobs-cli,jobs-api] quota shows different units for limit and usage - https://phabricator.wikimedia.org/T361120#9721530 (10dcaro) [07:51:31] 10Toolforge (Toolforge iteration 09): [maintain-harbor] Have maintain-harbor use a robot account - https://phabricator.wikimedia.org/T361698#9721531 (10dcaro) [07:51:35] 10Toolforge (Toolforge iteration 09): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9721532 (10dcaro) [07:51:41] 10Toolforge (Toolforge iteration 09): [toolforge] simplify calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377#9721533 (10dcaro) [07:51:47] 10Toolforge (Toolforge iteration 09), 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#9721534 (10dcaro) [07:51:52] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [api-gateway] Add a python server to serve consolidated openapi docs - https://phabricator.wikimedia.org/T362299#9721514 (10dcaro) [07:51:54] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09): [docs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313#9721535 (10dcaro) [07:51:59] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review, 07Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417#9721536 (10dcaro) [07:52:03] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [jobs-api,buildservice-api,envvars-api] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745#9721516 (10dcaro) [07:52:25] 06cloud-services-team, 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: Toolforge: Introduce grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9721522 (10dcaro) [08:06:11] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9721597 (10dcaro) 05Open→03In progress [08:15:21] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9721617 (10dcaro) Things I've tried: === Yesterday === Remount the nfs volume: ` root@tools-k8s-worker-nfs-1:~# mount -o remount /mnt/nfs/labstore-secondary-tools-hom... [08:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 51 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:23:53] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9721628 (10dcaro) More tests: ` root@tools-k8s-worker-nfs-1:~# ls -l /mnt/nfs/labstore-secondary-tools-home/ & -> works (takes ~25s) root@tools-k8s-worker-nfs-1:~# l... [08:27:14] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9721648 (10dcaro) On a new mount with `intr` option added, I get the same behavior (`intr` does not show up later when showing the mounts though, so not sure it's being... [08:37:42] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9721667 (10dcaro) Mounting it as soft: ` root@tools-k8s-worker-nfs-1:~# mount -o rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,ret... [08:45:32] (03PS1) 10SimmeD: Updated the code a bit. [labs/tools/dawiki] - 10https://gerrit.wikimedia.org/r/1020708 [08:47:49] (03PS2) 10SimmeD: Updated the code a bit. [labs/tools/dawiki] - 10https://gerrit.wikimedia.org/r/1020708 [08:50:14] (03CR) 10SimmeD: [V:03+2 C:03+2] "LGTM" [labs/tools/dawiki] - 10https://gerrit.wikimedia.org/r/1020708 (owner: 10SimmeD) [09:21:05] 06cloud-services-team, 10Toolforge: review pod templates for stricter security - https://phabricator.wikimedia.org/T362050#9721793 (10aborrero) 05Open→03In progress p:05Triage→03Medium [10:19:03] (ToolforgeKubernetesWorkerTooManyDProcesses) firing: Kubernetes worker tools-k8s-worker-nfs-1 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:32:57] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9721943 (10dcaro) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020767 to disable file buffering for responses, let's see how... [10:34:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9721951 (10dcaro) btw. an lsof showed that the files were deleted already: ` root@proxy-03:/etc/nginx# lsof -n | grep var/lib/nginx... [10:44:52] 06cloud-services-team, 10Striker, 10Data-Persistence-Backup, 06DBA, 13Patch-For-Review: Create a database for Striker test instance - https://phabricator.wikimedia.org/T360149#9721970 (10ABran-WMF) @taavi you can replace all occurrences of striker_toolsbeta by strikertoolsbeta (i.e. `striker_toolsbeta_ad... [10:51:22] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:53:01] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:53:40] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:05:13] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9722023 (10dcaro) I rebooted the node, we'll have to do some extra testing :/ [11:23:33] (ToolforgeKubernetesWorkerTooManyDProcesses) resolved: Kubernetes worker tools-k8s-worker-nfs-1 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:24:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:13:20] 10Toolforge (Toolforge iteration 09): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9722249 (10dcaro) @aborrero pointed out that this might be a consequence of the OOM killer killing a process at a bad moment. OOM killer is the standard way that cgrou... [12:17:42] (CloudVPSDesignateLeaks) firing: (3) Detected 50 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:06:01] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: review pod templates for stricter security - https://phabricator.wikimedia.org/T362050#9722329 (10CodeReviewBot) aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/75 Draft: jobs-api: introduce securityContext... [13:17:07] (03PS1) 10Ladsgroup: Setting dummy password for cumin dedicated mysql user [labs/private] - 10https://gerrit.wikimedia.org/r/1020828 [13:25:41] (03CR) 10Ladsgroup: [C:03+1] Setting dummy password for cumin dedicated mysql user [labs/private] - 10https://gerrit.wikimedia.org/r/1020828 (owner: 10Ladsgroup) [13:25:53] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:26:10] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Setting dummy password for cumin dedicated mysql user [labs/private] - 10https://gerrit.wikimedia.org/r/1020828 (owner: 10Ladsgroup) [13:29:24] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9722407 (10Andrew) 05Open→03Resolved These are now in service and working fine. [14:09:54] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9722565 (10dcaro) I'm declaring this a win, it has not used any space after the change: ` root@proxy-03:/etc/nginx# lsof -n | grep var/lib/n... [14:10:10] 06cloud-services-team, 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: Toolforge: Introduce grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9722561 (10dcaro) 05In progress→03Stalled [14:18:54] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Goal, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287#9722606 (10Andrew) codfw1dev is now running bobcat. The only (minor) issue I'm aware of so far is T350807 [14:26:39] (ProbeDown) firing: (3) Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:30:26] 10Data-Services: AFD stats update delay - https://phabricator.wikimedia.org/T362732#9722665 (10TheTechie) It seems the entirety of Xtools is struggling from this. My Mainspace edit count on Xtools isn't updating either. [14:31:39] (ProbeDown) resolved: (4) Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:39:10] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9722693 (10Ahecht) [14:40:42] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9722696 (10Ahecht) I updated the description. Any tools that rely on database replicas, including all toolforge tools that rely on data not available through the API, are affected by this. [14:40:49] 06cloud-services-team, 10VPS-project-devtools, 06collaboration-services, 13Patch-For-Review, and 2 others: Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9722698 (10Dzahn) {F47170916} ^ afraid it's not stable yet. seems down again. [14:47:09] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9722706 (10TheTechie) [14:47:38] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9722708 (10TheTechie) Updated it again as some info that I previously had was redundant. [14:53:50] (CloudVPSDesignateLeaks) firing: (3) Detected 58 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:53:50] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:55:23] (CloudVPSDesignateLeaks) resolved: (3) Detected 58 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:58:34] 10Toolforge (Software install/update), 10Mismatch Finder, 10Wikidata, 13Patch-For-Review: rsync missing from dev.toolforge.org - https://phabricator.wikimedia.org/T362679#9722761 (10taavi) 05Open→03Resolved a:03taavi Done. (Although I wonder if you could use [[ https://wikitech.wikimedia.org/wiki... [15:08:50] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:17:21] 10Toolforge (Software install/update), 10Mismatch Finder, 10Wikidata, 13Patch-For-Review: rsync missing from dev.toolforge.org - https://phabricator.wikimedia.org/T362679#9722868 (10Lucas_Werkmeister_WMDE) Yeah, it feels like the sort of tool it could be quite useful for. [15:20:37] (HAProxyBackendUnavailable) resolved: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:29:33] 10Toolforge (Toolforge iteration 09): [infra] Add alert when workers have a sustained large amount of D processes - https://phabricator.wikimedia.org/T362093#9723016 (10dcaro) 05In progress→03Resolved [15:29:45] 06cloud-services-team, 10Toolforge (Toolforge iteration 09): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9723013 (10dcaro) 05In progress→03Resolved [16:12:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:22:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:26:11] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9723453 (10Liz) >>! In T362732#9722693, @Ahecht wrote: > I updated the description. Any tools that rely on database replicas, including all toolforge tools that rely on data not available through the API,... [16:27:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:31:16] (ProbeDown) firing: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:36:16] (ProbeDown) resolved: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:44:36] 10Toolforge, 03Wikimedia-Hackathon-2024: Build a tool (or tools) to easily visualize DP datasets - https://phabricator.wikimedia.org/T362805#9723550 (10Htriedman) [16:51:37] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09): [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9723581 (10fnegri) 05In progress→03Resolved This change was released in toolforge-jobs-framework-cli 16.0.8, which is now deployed on all tools a... [17:13:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:19:13] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9723669 (10TheTechie) And now 24 [17:28:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:30:53] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:33:59] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9723706 (10taavi) The lag will go up until the schema changes (linked above) finish applying, and after that it will go back down. Commenting about the lag here (or doing anything else at this point, reall... [18:16:35] 10Toolforge: Cannot connect to dev.toolforge.org using Mosh with custom locale - https://phabricator.wikimedia.org/T362680#9723890 (10LucasWerkmeister) After today’s reimage, this affects login.toolforge.org as well; I assume it’s also the reason I’m now seeing broken UTF-8 in my commit messages: {F47199359} (... [18:25:39] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9723928 (10LucasWerkmeister) [18:28:50] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9723940 (10LucasWerkmeister) @taavi since you mentioned cultural imperialism in that commit message… this is pretty meh :/ {F47200922} [18:29:12] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9723954 (10LucasWerkmeister) [18:37:54] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9723978 (10taavi) The main thing I'm confused here is that why are the locales not breaking on any other instance. The previous bastions having all the locales installed was due t... [18:43:39] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9724001 (10LucasWerkmeister) It happens there as well, yes. {F47203088} It’s probably not new in general, but it sucks that it now affects the Toolforge bastions, and the Mosh pro... [19:08:50] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:12:59] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9724073 (10taavi) Ok, thanks. After giving this a bit more thought: * This really should work out of the box. I feel like something very cursed is happening on these instances whe... [19:16:16] 10Toolforge: Toolforge is missing locales (breaks Mosh and causes various other problems) - https://phabricator.wikimedia.org/T362680#9724090 (10LucasWerkmeister) Sounds good to me; though I’m also curious what other “shared SSH”-style environments do (e.g. if the `AcceptEnv` adjustment is common or not). Sadly... [19:32:16] 10Cloud-Services: Slow loading on Toolforge - https://phabricator.wikimedia.org/T362822 (10GPSLeo) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this tas... [19:35:58] 10Toolforge: Slow loading on Toolforge - https://phabricator.wikimedia.org/T362822#9724179 (10JJMC89) [19:36:54] 10Cloud Services Proposals, 10Toolforge: Slow loading on Toolforge - https://phabricator.wikimedia.org/T362822#9724183 (10GPSLeo) [19:43:04] 10Toolforge: Slow loading on Toolforge - https://phabricator.wikimedia.org/T362822#9724192 (10taavi) [19:50:11] 10Tool-Global-user-contributions, 06Stewards-and-global-tools, 10Temporary accounts, 10XTools, and 2 others: [Design] Update wireframes with user testing learnings - https://phabricator.wikimedia.org/T359827#9724199 (10KColeman-WMF) [19:54:33] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9724217 (10TheDJ) And also a reminder that this is expected every once in a while. It even has it's own page https://en.wikipedia.org/wiki/Wikipedia:Replication_lag [20:10:03] (ToolforgeKubernetesWorkerTooManyDProcesses) firing: Kubernetes worker tools-k8s-worker-nfs-50 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:20:03] (ToolforgeKubernetesWorkerTooManyDProcesses) resolved: Kubernetes worker tools-k8s-worker-nfs-50 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:20:18] (ToolforgeKubernetesWorkerTooManyDProcesses) firing: Kubernetes worker tools-k8s-worker-nfs-50 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:21:33] (ToolforgeKubernetesWorkerTooManyDProcesses) resolved: Kubernetes worker tools-k8s-worker-nfs-50 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:25:18] (ToolforgeKubernetesWorkerTooManyDProcesses) firing: Kubernetes worker tools-k8s-worker-nfs-50 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:30:18] (ToolforgeKubernetesWorkerTooManyDProcesses) resolved: Kubernetes worker tools-k8s-worker-nfs-50 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:42:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:47:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:48:28] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-50 [20:49:13] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-50 [20:52:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:57:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:14:06] 10Toolforge: Slow loading on Toolforge - https://phabricator.wikimedia.org/T362822#9724475 (10dcaro) @GPSLeo do you have a specific link that we can test? [21:27:38] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services, 13Patch-For-Review: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9724503 (10Dzahn) puppetmaster-1001 shut down but not deleted just yet [21:35:53] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:49:07] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9724561 (10Nintendofan885) a:05Nintendofan885→03None [21:50:19] 10Data-Services: enwiki_p database replica has stopped updating - https://phabricator.wikimedia.org/T362732#9724558 (10Nintendofan885) 05Open→03Resolved a:03Nintendofan885 Looks like it's finally caught up now [22:06:11] 06cloud-services-team, 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9724584 (10Anomie) >>! In T360488#9677736, @bd808 wrote: > Dropping this into the "needs discussion" column for #cloud-services-team as a... [22:08:07] 10Toolforge: Run non-interactive commands on Toolforge kubernetes webservices - https://phabricator.wikimedia.org/T169695#9724585 (10bd808) 05Open→03Resolved a:03bd808 https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework does this work now. [22:12:52] 06cloud-services-team, 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9724595 (10bd808) >>! In T360488#9724584, @Anomie wrote: >>>! In T360488#9677736, @bd808 wrote: >> Dropping this into the "needs discussio... [22:19:05] 06cloud-services-team, 10Toolforge (Software install/update): Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#9724604 (10Anomie) I can't say that having to change various references to login.toolforge.org in my stuff to login-buster.toolforge.org (... [22:38:50] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:43:48] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:44:03] 10Toolforge: [builds-builder] Add support for specifying multiple buildpacks to run against a repo - https://phabricator.wikimedia.org/T362834 (10bd808) 03NEW [22:44:05] 10Toolforge: [builds-builder] Add support for specifying multiple buildpacks to run against a repo - https://phabricator.wikimedia.org/T362834#9724693 (10bd808) As explained briefly in the irc snippet, I have a use case where #tool-bridgebot uses a 3rd party golang component ([[https://github.com/42wim/matterbri... [22:53:48] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:58:48] (PuppetConstantChange) firing: (3) Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:14:37] 10Toolforge: Provide a simple list of the built container images for a given tool (`toolforge build list` subset) - https://phabricator.wikimedia.org/T362836 (10bd808) 03NEW