[00:02:32] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10Raymond_Ndibe) >>! In T353740#9429362, @dcaro wrote: >> After a patch is merged into master, It doesn't seem to make sense to run the te... [00:11:09] (03CR) 10BryanDavis: [C: 03+2] phabricator: Allow setting source repository project field (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 (owner: 10Majavah) [00:12:51] (03Merged) 10jenkins-bot: phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 (owner: 10Majavah) [00:14:02] (03Abandoned) 10BryanDavis: Set code repository URI when creating project tags [labs/striker] - 10https://gerrit.wikimedia.org/r/971912 (https://phabricator.wikimedia.org/T320915) (owner: 10Aklapper) [00:43:03] (03PS1) 10BryanDavis: repo: Tweak label and help for toolinfo inclusion [labs/striker] - 10https://gerrit.wikimedia.org/r/997989 [00:55:12] (03CR) 10BryanDavis: [C: 03+2] repo: Tweak label and help for toolinfo inclusion [labs/striker] - 10https://gerrit.wikimedia.org/r/997989 (owner: 10BryanDavis) [00:56:54] (03Merged) 10jenkins-bot: repo: Tweak label and help for toolinfo inclusion [labs/striker] - 10https://gerrit.wikimedia.org/r/997989 (owner: 10BryanDavis) [01:09:19] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review: [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/71 [builds-ap... [01:20:52] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review: [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolfo... [01:21:08] 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10KHurd-WMF) [01:54:56] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review: [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/191 bui... [04:01:12] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10tstarling) I can work on this if someone makes me a maintainer. It'll be the same as what I did for panoviewer (except hopefully less complicated). [04:11:13] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] add quota information to NewBuild struct - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-... [05:11:35] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10Raymond_Ndibe) [05:16:41] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge... [06:06:29] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge... [08:00:23] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:02:10] (03CR) 10Majavah: phabricator: Offer to set issue tracker URL in toolinfo (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992146 (owner: 10Majavah) [08:05:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:27:03] 10Toolforge: Do not deprecate python versions on the toolforge jobs framework that are the default version on toolforge - https://phabricator.wikimedia.org/T356582 (10taavi) 05Open→03Declined Per JJMC. You should be using `webservice shell` or a one-off job instead of working on a venv on the bastion... [08:39:25] (03PS1) 10Eugene233: Merge m2c branch to main [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998266 (https://phabricator.wikimedia.org/T356772) [08:39:49] (03CR) 10CI reject: [V: 04-1] Merge m2c branch to main [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998266 (https://phabricator.wikimedia.org/T356772) (owner: 10Eugene233) [08:42:53] (03PS2) 10Eugene233: Merge m2c branch to main [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998266 (https://phabricator.wikimedia.org/T356772) [08:55:49] 10Cloud-VPS, 10cloud-services-team: Do not NAT traffic to cloud-private - https://phabricator.wikimedia.org/T356850 (10taavi) [09:26:14] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10Magnus) This issue is persisting for the listeria tool. The bot is now down from ~40K edits/day to ~500. Something needs to be done, soon. [10:10:51] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10dcaro) p:05Triage→03High [10:10:54] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10dcaro) [10:10:56] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) p:05Triage→03Medium [10:11:02] 10Toolforge (Toolforge iteration 05), 10Epic: Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262 (10dcaro) p:05Triage→03High [10:11:09] 10Toolforge (Toolforge iteration 05): [Toolforge CLI consolidation] Explore OpenAPI tooling - https://phabricator.wikimedia.org/T356261 (10dcaro) p:05Triage→03High [10:11:22] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10dcaro) p:05Triage→03Medium [10:11:48] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [maintain-harbor] Improvements to subcommands and config validation - https://phabricator.wikimedia.org/T353059 (10dcaro) p:05Triage→03Medium [10:12:05] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [tbs] cleanup robot account related code - https://phabricator.wikimedia.org/T352763 (10dcaro) p:05Triage→03High [10:12:07] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10dcaro) 05Open→03In progress [10:12:13] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10User-Raymond_Ndibe: [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10dcaro) p:05Triage→03High [10:12:17] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [tbs] cleanup robot account related code - https://phabricator.wikimedia.org/T352763 (10dcaro) p:05High→03Medium [10:12:29] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [tbs.maintain-harbor] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176 (10dcaro) p:05Triage→03High [10:12:39] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [tbs][builds-api] Refactor `internal/builds.go` - https://phabricator.wikimedia.org/T352762 (10dcaro) p:05Triage→03Medium [10:12:44] 10Toolforge (Toolforge iteration 05), 10User-Raymond_Ndibe: [toolforge-cd] find out why we run two gitlab ci/cd pipelines after merge - https://phabricator.wikimedia.org/T353563 (10dcaro) p:05Triage→03Medium [10:12:51] 10Toolforge (Toolforge iteration 05): [toolforge-cd] gitlab-ci refactor - https://phabricator.wikimedia.org/T353514 (10dcaro) p:05Triage→03Medium [10:12:58] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10dcaro) p:05Triage→03Medium [10:13:07] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) p:05Triage→03High [10:13:12] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10dcaro) p:05Triage→03High [10:13:25] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10Upstream: [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417 (10dcaro) p:05Triage→03High [10:13:28] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [harbor] upgrade to 2.10.x - https://phabricator.wikimedia.org/T354507 (10dcaro) p:05Triage→03High [10:13:35] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: [builds-api] Automatically deploy the webservice when the image is built - https://phabricator.wikimedia.org/T341065 (10dcaro) p:05Triage→03High [10:26:09] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) @Magnus I have been unable to reproduce myself using a simple script (just curl to the url mentioned in this task) from the same node that was failing before. Can I use your j... [10:27:54] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) I have found that it happens also on old nodes, for example, rustbot (from the listeria tool): ` tools.listeria... [10:29:14] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) Hmpf... it was just killed by OOM: ` [Wed Feb 7 10:26:56 2024] Memory cgroup out of memory: Kill process 2443... [10:38:48] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) I was able to run a pcap on a full request: ` tools.listeria@rustbot-6548cb7b94-749wm:~$ curl -v 'https://commo... [10:42:03] (03PS1) 10Juniorbesong: Bug: T [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998331 [10:42:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998331 (owner: 10Juniorbesong) [10:42:54] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) Mtr from an affected worker (43) and a non-affected one (44) look the same: ` ### 43 root@tools-k8s-worker-43:~... [10:46:22] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10Magnus) Please use the bot as you like for testing! "rustbot" is just the job name for the listeria bot. I invoke it like so: ` toolforge-jobs run --image tf-php74 --mem 5000Mi --cp... [11:35:29] 10Toolforge (Toolforge iteration 05): [toolforge-cd] discuss the possibility of removing tests from merge request ci/cd pipelines - https://phabricator.wikimedia.org/T353740 (10dcaro) >>! In T353740#9519630, @Raymond_Ndibe wrote: >>>! In T353740#9429362, @dcaro wrote: >>> After a patch is merged into master, It... [11:35:31] 10Grid-Engine-to-K8s-Migration: Migrate articles-by-lat-lon-without-images from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319573 (10Abbe98) Thank you @dcaro! I will give it a try over the weekend. [11:40:54] 10Grid-Engine-to-K8s-Migration: Migrate map-search from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319875 (10Abbe98) Thank you @dcaro, I will however migrate this tool off Toolforge and just put up a static "this tool has moved page" to catch people accessing the old urls. [11:44:29] 10Grid-Engine-to-K8s-Migration: Migrate wmf-sitematrix from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320180 (10Abbe98) I'm not entirely sure how to migrate this one as the jobs part was implemented by WMF staff. I'm not sure how much it's in use today but in the past vari... [11:52:22] 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-aborrero: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 (10taavi) 05Stalled→03Open [11:52:29] 10cloud-services-team (FY2022/2023-Q4), 10Goal, 10Patch-For-Review, 10User-aborrero: eqiad1: allocate public IPv4 CIDR for BGP-based virtual IP addresses - https://phabricator.wikimedia.org/T341220 (10taavi) [11:59:43] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) So it seems that the bots are triggering the silent-drop limits on the haproxies: ` vgutierrez> Valentin Gutier... [12:25:48] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) So update on this, it seems that we are hitting the 500 concurrent connections from a single IP (the worker node). From this, I've noticed that the rustbot process is opening... [12:35:40] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10Magnus) Thanks to the recently relaxed limitations per job on Toolforge, I am running this with more async threads, so without actual code change, it would have increased the number o... [12:43:20] 10Toolforge, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: kubelet: cannot work with docker >= v25 - https://phabricator.wikimedia.org/T356629 (10aborrero) 05Open→03Resolved a:03aborrero We did exactly as @taavi suggested. [13:01:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:05:09] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge API] Investigate ways to present our openapi definitions to users - https://phabricator.wikimedia.org/T354745 (10aborrero) So far, the only thing I could find online is this: https://github.com/Trax-air/swagger-aggregator which seems a bit old (... [13:06:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:16:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:21:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:25:33] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Patch-For-Review, 10User-aborrero: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 (10taavi) [13:32:11] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10dcaro) > Is the connection limit per-wiki or for all wikis together? It's per CDN node unfortunately, that means that wikis get aggregated. > I will look into limiting this, and the... [13:34:02] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) So yes, we have a strong limit on the CDN layer of 500 concurrent connections per-ip, and each worker node has... [13:48:01] 10Toolforge: ChieBot: Intermittent connection reset by peer errors - https://phabricator.wikimedia.org/T356163 (10dcaro) @Joe thanks! Yes, the issue is unrelated to the k8s workers, we were just hitting the limit of concurrent connections to the CDN per-ip. In that sense, @Leloiandudu have you had any issues l... [13:48:51] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge API] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client - https://phabricator.wikimedia.org/T354745 (10aborrero) [13:51:34] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [nova-api,cloudrabbit] Connectivity issues from all cloudcontrols to all cloudrabbit nodes - https://phabricator.wikimedia.org/T356621 (10dcaro) Unfortunately this did not seem to he... [13:53:12] 10Toolforge (Toolforge iteration 05), 10User-aborrero: [toolforge] several tools get periods of connection refused (104) when connecting to wikis - https://phabricator.wikimedia.org/T356164 (10dcaro) [13:55:31] 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10fnegri) @KHurd-WMF if you can make OpenVAS listen to requests on port 80 (which seems it should be doing by default), setting up the web proxy to forward requests to it should be straightforward.... [14:06:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:12:32] 10Toolforge: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs - https://phabricator.wikimedia.org/T356160 (10Magnus) I have limited it to 50 API connection at a time. Still throwing 104 errors. Either I did it wrong, or there is some other issue. [14:16:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:17:24] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro The same response was sent for each case please advise how you would like me to proceed. De... [14:47:05] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: Do not NAT traffic to cloud-private - https://phabricator.wikimedia.org/T356850 (10taavi) p:05Triage→03Medium [15:04:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:30] 10Grid-Engine-to-K8s-Migration, 10Growth-Team: Migrate ERANBOT project off of Grid Engine - https://phabricator.wikimedia.org/T306888 (10komla) Noted [15:09:17] 10Grid-Engine-to-K8s-Migration: Migrate wmf-sitematrix from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320180 (10komla) @Abbe98 I will move this from the 'Backlog' to 'Help Wanted' for now. [15:09:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:10:57] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:15:00] 10Data-Services, 10cloud-services-team, 10Data-Platform, 10Patch-For-Review: Add global_edit_count to wikireplicas - https://phabricator.wikimedia.org/T344108 (10BTullis) @lbowmaker, @WDoranWMF, @Ahoelzl - Would you be able to help us to define the procedure here please? We have anew request for a change... [15:20:11] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:20:56] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:25:23] (03PS1) 10Juniorbesong: BUG: T320500 modified isa/campaigns/image_updater.py [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998432 [15:37:09] (CephSlowOps) firing: Ceph cluster in eqiad has 16 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:37:24] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [15:40:59] 10Cloud-VPS (Quota-requests): Floating IP request for project Openvas - https://phabricator.wikimedia.org/T356830 (10bd808) >>! In T356830#9520964, @fnegri wrote: > @KHurd-WMF if you can make OpenVAS listen to requests on port 80 (which seems it should be doing by default), setting up the web proxy to forward re... [15:41:09] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:44:28] (InstanceDown) firing: Project cloudinfra instance enc-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:44:28] (InstanceDown) firing: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:44:55] (PawsJupyterHubDown) firing: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:45:28] (InstanceDown) firing: (3) Project tools instance tools-k8s-worker-69 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:45:28] (InstanceDown) firing: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:45:28] (InstanceDown) firing: Project toolsbeta instance toolsbeta-sgecron-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:45:49] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 340 bytes in 60.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:45:51] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:46:28] (InstanceDown) firing: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:46:28] (WidespreadInstanceDown) firing: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:46:57] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/cron - 340 bytes in 60.013 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:49:07] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 40.652 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:49:09] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 37.775 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [15:49:55] (PawsJupyterHubDown) resolved: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:50:28] (InstanceDown) resolved: (2) Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:50:28] (InstanceDown) resolved: (2) Project toolsbeta instance toolsbeta-sgecron-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:50:28] (InstanceDown) resolved: (9) Project tools instance tools-k8s-worker-53 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:50:51] (ProbeDown) firing: (5) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:51:10] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:51:28] (InstanceDown) resolved: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:51:28] (WidespreadInstanceDown) resolved: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [15:52:10] (CephSlowOps) resolved: Ceph cluster in eqiad has 256 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:54:28] (InstanceDown) resolved: Project cloudinfra instance enc-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:54:28] (InstanceDown) resolved: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:04:06] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [16:08:58] (InstanceDown) firing: (10) Project tools instance tools-k8s-worker-53 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:13:58] (InstanceDown) resolved: (3) Project tools instance tools-k8s-worker-nfs-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:46:37] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:51:37] (ProbeDown) resolved: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:55:51] (ProbeDown) firing: (3) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:00:58] (InstanceDown) firing: (3) Project tools instance tools-k8s-worker-98 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:01:51] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for all workers [17:02:39] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [17:03:46] !log taavi@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for all workers [17:05:09] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [17:05:40] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for all workers [17:05:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-98 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:23:21] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [17:24:32] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for all workers [17:35:51] (ProbeDown) firing: (3) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:51:31] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-30.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [17:58:06] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-9 [17:58:52] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-9 [18:00:05] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [18:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [18:12:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-nfs-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:17:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-nfs-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:21:43] (03PS1) 10Andrew Bogott: k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [18:22:43] (03CR) 10BryanDavis: [V: 03+1 C: 03+2] phabricator: Offer to set issue tracker URL in toolinfo (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992146 (owner: 10Majavah) [18:25:32] (03CR) 10CI reject: [V: 04-1] k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [18:31:28] 10VPS-project-Wikistats, 10User-RhinosF1: Wikistats is using a malformed user agent - https://phabricator.wikimedia.org/T354101 (10Dzahn) I have deployed the permanent change to the user agent and running some updates. We will have to keep an eye on how many wikis may not return results anymore as before. [18:33:25] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) [18:34:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-94 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:36:51] 10Grid-Engine-to-K8s-Migration: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10Dvorapa) [18:38:33] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) Around the time I first tried to delete this snapshot (today around 14:45 UTC), we started having issues on the Ceph cluster: {T334240} This might or mig... [18:39:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240 (10fnegri) [18:39:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-94 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:42:38] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) [18:43:26] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) a:03fnegri [18:43:33] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) p:05Triage→03High [18:43:48] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [toolsdb] test creating a new replica host - https://phabricator.wikimedia.org/T344717 (10fnegri) [18:44:15] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) 05Open→03In progress [18:44:35] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) [18:46:54] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) [18:49:29] 10Grid-Engine-to-K8s-Migration: Migrate kmlexport from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T356905 (10dcaro) Currently there's no easy way to run toolforge commands from within k8s. It would be relatively easy to add health probes to the webservices though, and get th... [18:50:51] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:51:45] 10Toolforge (Toolforge iteration 05): [webservice] Add health probes for port 8080 - https://phabricator.wikimedia.org/T356907 (10dcaro) [18:51:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-90 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:56:51] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:56:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-90 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:01:35] (03PS1) 10Eevans: (faux) keys & certs for sessionstore200[4-6] [labs/private] - 10https://gerrit.wikimedia.org/r/998504 (https://phabricator.wikimedia.org/T356829) [19:01:37] (03PS1) 10Eevans: cleanup obsolete keys & certs (hosts decommissioned) [labs/private] - 10https://gerrit.wikimedia.org/r/998505 [19:06:51] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:07:17] (03CR) 10Eevans: [V: 03+2 C: 03+2] (faux) keys & certs for sessionstore200[4-6] [labs/private] - 10https://gerrit.wikimedia.org/r/998504 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [19:08:11] (03CR) 10Eevans: [V: 03+2 C: 03+2] cleanup obsolete keys & certs (hosts decommissioned) [labs/private] - 10https://gerrit.wikimedia.org/r/998505 (owner: 10Eevans) [19:08:58] (InstanceDown) firing: (3) Project tools instance tools-k8s-worker-88 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:13:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-88 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:18:58] (InstanceDown) firing: (3) Project tools instance tools-k8s-worker-85 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:23:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-85 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:35:51] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:40:51] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:59:33] 10Toolforge (Toolforge iteration 05): [webservice] Add health probes for port 8080 - https://phabricator.wikimedia.org/T356907 (10taavi) [19:59:56] 10Toolforge: Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10taavi) [20:00:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-72 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:05:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-72 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:10:03] 10Toolforge (Toolforge iteration 05): Support probes in kubernetes webservices - https://phabricator.wikimedia.org/T341919 (10dcaro) a:03dcaro [20:41:22] 10Cloud-VPS, 10Toolforge: define a prebaked way to temporarily disable access for a user or Tool - https://phabricator.wikimedia.org/T147242 (10Andrew) The topic of this doesn't quite fit with the initial description. I /think/ that T170355 is the same ask (and it's done, and somewhat documented) but I'm conf... [20:43:29] 10Striker, 10ARM support, 10User-bd808: Make developer environment work on Apple Silicon - https://phabricator.wikimedia.org/T354467 (10bd808) [20:44:08] 10Striker, 10ARM support, 10User-bd808: "Operation not supported: AH00023: Couldn't create the mpm-accept mutex" Apache2 crash under QEMU emulation - https://phabricator.wikimedia.org/T354468 (10bd808) 05Open→03Resolved [20:44:10] 10Striker, 10ARM support, 10User-bd808: Make developer environment work on Apple Silicon - https://phabricator.wikimedia.org/T354467 (10bd808) [20:44:21] 10Striker, 10ARM support, 10User-bd808: Make developer environment work on Apple Silicon - https://phabricator.wikimedia.org/T354467 (10bd808) 05In progress→03Resolved [20:48:58] (InstanceDown) firing: (3) Project tools instance tools-k8s-worker-58 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:52:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:53:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-58 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:57:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:01:31] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-21.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [21:01:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-51 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:06:58] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-51 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:14:51] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:17:58] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-48 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:19:51] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:22:58] (InstanceDown) resolved: (3) Project tools instance tools-k8s-worker-48 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:30:01] (03PS2) 10Andrew Bogott: k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [21:30:03] (03PS1) 10Andrew Bogott: k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 [21:31:14] (03PS2) 10Andrew Bogott: k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 [21:31:16] (03PS3) 10Andrew Bogott: k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [21:33:09] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all workers [21:33:41] (03CR) 10CI reject: [V: 04-1] k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [21:33:50] (03CR) 10CI reject: [V: 04-1] k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [21:34:12] (03CR) 10CI reject: [V: 04-1] k8s.kubernetes.reboot: Wait at most one minute before doing a hard reboot [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott) [21:39:26] (03PS4) 10Andrew Bogott: k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [21:42:43] (03CR) 10CI reject: [V: 04-1] k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 (owner: 10Andrew Bogott) [21:44:34] (03CR) 10Andrew Bogott: "Verified -1 seems to be something broken for this whole repo" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998556 (owner: 10Andrew Bogott) [21:46:15] (03PS5) 10Andrew Bogott: k8s.reboot: periodically report success/failure/remaining [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/998492 [22:45:59] 10Grid-Engine-to-K8s-Migration, 10Tool-wikiloves: Migrate wikiloves from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320160 (10Danilo) @JeanFred: Do you need help with this task? I don't see the code of wikiloves for years, but if it is still in python2 I can create a virt...