[00:10:23] (03CR) 10BryanDavis: "My keystone container is also segfaulting when I try to submit a membership approval. Based on the state of my local database and ldap dir" [labs/striker] - 10https://gerrit.wikimedia.org/r/1009232 (https://phabricator.wikimedia.org/T144943) (owner: 10Majavah) [00:30:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [00:41:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:51:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:11:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:21:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:41:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [03:11:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:21:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:30:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [03:37:15] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:41:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [06:01:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:06:50] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:30:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [07:37:16] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:41:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:50:22] (03PS3) 10Majavah: tools: Don't query OpenStack on app startup [labs/striker] - 10https://gerrit.wikimedia.org/r/1009599 [08:50:24] (03PS4) 10Majavah: Migrate dependency management to Poetry [labs/striker] - 10https://gerrit.wikimedia.org/r/1009592 [08:50:26] (03PS4) 10Majavah: Add test to make sure app boots up properly [labs/striker] - 10https://gerrit.wikimedia.org/r/1009593 [08:57:25] (03CR) 10Majavah: [C: 03+2] tools: Don't query OpenStack on app startup [labs/striker] - 10https://gerrit.wikimedia.org/r/1009599 (owner: 10Majavah) [08:58:43] (03Merged) 10jenkins-bot: tools: Don't query OpenStack on app startup [labs/striker] - 10https://gerrit.wikimedia.org/r/1009599 (owner: 10Majavah) [09:17:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review, 15User-aborrero: Deploy OVS test setup in codfw1dev - https://phabricator.wikimedia.org/T358761#9614357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudvirt2001-dev.codfw.wmn... [09:30:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [09:39:00] 10Striker: Striker: Allow searching for access requests for a given user - https://phabricator.wikimedia.org/T282704#9614391 (10taavi) 05Open→03Resolved [09:39:02] 10Striker: Add option to only show open access requests - https://phabricator.wikimedia.org/T359338#9614392 (10taavi) 05Open→03Resolved [09:40:38] 10Striker: Strikerbot adds a welcome message to a user's talk page every time that an approved membership request is edited - https://phabricator.wikimedia.org/T323447#9614394 (10taavi) 05Open→03Resolved There is a fix live now. I'm fairly sure it won't work properly before T144943 is fixed, but that's fine... [10:08:34] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 15User-aborrero: Deploy OVS test setup in codfw1dev - https://phabricator.wikimedia.org/T358761#9614413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudvirt2001-dev.codfw.wmnet with OS bookworm comple... [10:33:51] 10PAWS: New upstream release 9.0 for Pywikibot - https://phabricator.wikimedia.org/T359616 (10Xqt) 03NEW [10:35:28] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [10:35:35] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [10:35:51] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [10:35:55] !log taavi@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=97) [10:36:00] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [10:36:18] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) [10:39:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 15User-aborrero: Deploy OVS test setup in codfw1dev - https://phabricator.wikimedia.org/T358761#9614467 (10taavi) `lang=shell-session $ sudo wmcs-openstack network create --project admin --share --provider-network-type vxlan lan-flat-cloudinstances3 $ s... [10:52:20] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet2007-dev is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:58:05] how cautious the recovery message [10:59:56] spoiler alert: everything is not fine [11:13:09] let me know if you want me to take a look [11:15:12] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [11:15:20] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [11:15:57] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [11:16:16] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) [11:16:20] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [11:16:27] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [11:17:22] if you have a magical fix for '2024-03-08 10:38:24.213 1331 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent ; Stdout: ; Stderr: iptables-restore v1.8.9 (nf_tables): interface name `105c0477-6f00-4b3d-8749-795a34c5f9c4' must be shorter than IFNAMSIZ (15)' I'm more than happy to take it [11:18:22] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:21:44] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet2008-dev is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:23:43] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [11:24:02] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) [11:24:12] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 15User-aborrero: Deploy OVS test setup in codfw1dev - https://phabricator.wikimedia.org/T358761#9614712 (10taavi) Note: The UUID in the iptables error is the Neutron port UUID. So presumably that's not being mapped to the actual interface name somewhere... [11:27:14] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [11:27:22] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [11:27:39] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [11:27:57] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) [11:37:16] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:37:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:37:52] (HAProxyBackendUnavailable) resolved: HAProxy service nova-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:41:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [11:42:50] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:43:41] 05Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883#9614755 (10dcaro) >>! In T319883#9612763, @MBH wrote: > No, the problem persists: now three tools you initially added to `.sln` file works, and ~seven new, added by you to sln... [11:45:18] 06cloud-services-team, 13Patch-For-Review, 15User-dcaro, 15User-fgiunchedi: [wmcs][alerting] Allow volunteer admins silencing alerts from cloudvps/toolforge/paws/quarry - https://phabricator.wikimedia.org/T320973#9614756 (10dcaro) >>! In T320973#9612797, @andrea.denisse wrote: > @dcaro My apologies for the... [11:47:01] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: prepare deb packages for k8s 1.24 - https://phabricator.wikimedia.org/T359619 (10aborrero) 03NEW [11:49:27] 10PAWS: jupyterlab to 4.1.4 - https://phabricator.wikimedia.org/T359588#9614768 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/385 [11:49:41] vivian-rook closed https://github.com/toolforge/paws/pull/385 [11:49:47] 10PAWS: jupyterlab to 4.1.4 - https://phabricator.wikimedia.org/T359588#9614771 (10rook) 05Open→03Resolved [11:50:53] 10PAWS: New upstream release 9.0 for Pywikibot - https://phabricator.wikimedia.org/T359616#9614773 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/387 [11:51:07] vivian-rook opened https://github.com/toolforge/paws/pull/387 [11:55:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:00:50] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:02:31] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [12:02:41] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [12:33:38] vivian-rook closed https://github.com/toolforge/paws/pull/387 [12:35:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [12:56:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:01:51] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:38:33] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: upgrade k8s etcd nodes to debian bookworm - https://phabricator.wikimedia.org/T359620#9614787 (10aborrero) p:05Triage→03Low [13:38:33] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: upgrade k8s etcd nodes to debian bookworm - https://phabricator.wikimedia.org/T359620#9614789 (10taavi) Duplicate of {T349207}? [13:38:35] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209#9614793 (10taavi) [13:38:41] 10Toolforge (Toolforge iteration 07), 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 15User-dcaro: [maintain-kubeusers] Allow setting the requests cpu and mem quota - https://phabricator.wikimedia.org/T357881#9614791 (10taavi) 05In progress→... [13:38:59] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209#9614794 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/218 maintain-kubeusers: Bump... [13:39:07] 10Toolforge (Toolforge iteration 07), 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 15User-dcaro: [maintain-kubeusers] Allow setting the requests cpu and mem quota - https://phabricator.wikimedia.org/T357881#9614795 (10taavi) a:05dcaro→03No... [13:39:15] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209#9614796 (10taavi) a:05dcaro→03taavi [13:39:23] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209#9614797 (10CodeReviewBot) taavi closed https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/12 quota: allow overriding... [13:39:31] 10Toolforge (Toolforge iteration 07), 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 15User-dcaro: [maintain-kubeusers] Allow setting the requests cpu and mem quota - https://phabricator.wikimedia.org/T357881#9614798 (10CodeReviewBot) taavi clo... [13:39:55] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209#9614802 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/218 maintain-kubeusers: Bump... [13:40:35] 10PAWS: Move nfs off of puppet? - https://phabricator.wikimedia.org/T359622 (10rook) 03NEW [13:40:59] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209#9614820 (10taavi) 05Open→03Resolved ` starting a run Update quota for tool wd-shex-infer from version '2-T357209-2' to version '2-T357209-3' finished... [13:41:07] 05Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140#9614822 (10taavi) [13:42:03] 10Data-Services, 06cloud-services-team: Advice needed: creating a row for every article across every language Wikipedia in ToolsDB - https://phabricator.wikimedia.org/T359564#9614840 (10taavi) [13:43:32] 10Data-Services, 06cloud-services-team: Advice needed: creating a row for every article across every language Wikipedia in ToolsDB - https://phabricator.wikimedia.org/T359564#9614854 (10taavi) Hi. Do you have any estimates on how big that table would be on disk? I'd assume we're talking in gigabytes at least h... [13:46:33] 10PAWS: New upstream release 9.0 for Pywikibot - https://phabricator.wikimedia.org/T359616#9614920 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/387 [13:47:05] 10PAWS: New upstream release 9.0 for Pywikibot - https://phabricator.wikimedia.org/T359616#9614922 (10rook) 05Open→03Resolved a:03rook [13:54:43] 10Toolforge (Toolforge iteration 07), 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#9615016 (10Slst2020) [13:54:59] 10Toolforge (Toolforge iteration 07): [Toolforge CLI consolidation] Explore OpenAPI SDK tooling - https://phabricator.wikimedia.org/T356261#9615015 (10Slst2020) 05Open→03In progress [13:57:04] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: upgrade k8s etcd nodes to debian bookworm - https://phabricator.wikimedia.org/T359620#9615068 (10aborrero) [13:57:12] 10Toolforge, 06cloud-services-team, 07Kubernetes: [infra] Upgrade Toolforge K8s etcd nodes to Bookworm - https://phabricator.wikimedia.org/T349207#9615070 (10aborrero) [13:57:20] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: upgrade k8s etcd nodes to debian bookworm - https://phabricator.wikimedia.org/T359620#9615072 (10aborrero) thanks [13:58:13] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: Upgrade Toolforge Kubernetes to version 1.24 - https://phabricator.wikimedia.org/T307651#9615093 (10aborrero) [13:58:21] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 13Patch-For-Review, 15User-aborrero: toolforge: prepare deb packages for k8s 1.24 - https://phabricator.wikimedia.org/T359619#9615091 (10aborrero) 05Open→03In progress p:05Triage→03Medium [13:58:45] (03PS5) 10Majavah: Migrate dependency management to Poetry [labs/striker] - 10https://gerrit.wikimedia.org/r/1009592 [13:58:53] (03PS5) 10Majavah: Add test to make sure app boots up properly [labs/striker] - 10https://gerrit.wikimedia.org/r/1009593 [13:59:01] (03PS1) 10Majavah: tools: Fix pagination query generation warnings [labs/striker] - 10https://gerrit.wikimedia.org/r/1009753 [13:59:25] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 06DC-Ops, 06SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9615106 (10dcaro) Notes from today's (more productive) meeting: About the number in specific, they got from the har... [14:00:37] (03CR) 10Majavah: [C: 03+2] Migrate dependency management to Poetry [labs/striker] - 10https://gerrit.wikimedia.org/r/1009592 (owner: 10Majavah) [14:00:54] (03CR) 10Majavah: [C: 03+2] Add test to make sure app boots up properly [labs/striker] - 10https://gerrit.wikimedia.org/r/1009593 (owner: 10Majavah) [14:01:01] (03CR) 10Majavah: [C: 03+2] tools: Fix pagination query generation warnings [labs/striker] - 10https://gerrit.wikimedia.org/r/1009753 (owner: 10Majavah) [14:01:26] (03Merged) 10jenkins-bot: Migrate dependency management to Poetry [labs/striker] - 10https://gerrit.wikimedia.org/r/1009592 (owner: 10Majavah) [14:02:06] (03Merged) 10jenkins-bot: Add test to make sure app boots up properly [labs/striker] - 10https://gerrit.wikimedia.org/r/1009593 (owner: 10Majavah) [14:02:14] (03Merged) 10jenkins-bot: tools: Fix pagination query generation warnings [labs/striker] - 10https://gerrit.wikimedia.org/r/1009753 (owner: 10Majavah) [14:03:04] (03PS1) 10Elukey: Add Docker secret for Dragonfly cache to ML K8s staging [labs/private] - 10https://gerrit.wikimedia.org/r/1009758 (https://phabricator.wikimedia.org/T359416) [14:03:32] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add Docker secret for Dragonfly cache to ML K8s staging [labs/private] - 10https://gerrit.wikimedia.org/r/1009758 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:13:27] 10Toolforge (Toolforge iteration 07): [Toolforge CLI consolidation] Explore OpenAPI SDK tooling - https://phabricator.wikimedia.org/T356261#9615196 (10Slst2020) [14:37:33] 10Wikibugs, 15User-bd808: Bot does not detect when ssh connection to Gerrit is interrupted - https://phabricator.wikimedia.org/T359096#9615261 (10bd808) 05Open→03In progress a:03bd808 I'm working on an [[https://asyncssh.readthedocs.io/en/latest/|asyncssh]] implementation of the polling loop. [14:38:28] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 15User-dcaro: OpenStack API response time gets slower over time - https://phabricator.wikimedia.org/T345084#9615267 (10fnegri) The alert triggered again yesterday, this time it was caused by a spike in response time for `nova-api_backend`, that has alre... [14:39:15] (03CR) 10BryanDavis: [C: 04-2] "test asyncssh in local dev" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1008016 (https://phabricator.wikimedia.org/T90594) (owner: 10BryanDavis) [14:41:28] (PuppetAgentFailure) firing: Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:52:28] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [14:52:47] 10Data-Services, 06cloud-services-team: Advice needed: creating a row for every article across every language Wikipedia in ToolsDB - https://phabricator.wikimedia.org/T359564#9615294 (10fnegri) Yep, Trove sounds like a better option here. If you think the full database can fit in less than 10GB, you can also t... [14:58:05] 10Toolforge (Toolforge iteration 07): [Toolforge CLI consolidation] Explore OpenAPI SDK tooling - https://phabricator.wikimedia.org/T356261#9615301 (10Slst2020) OpenAPI Ggenerator and Swagger Codegen look very similar on the surface. A little digging reveals that OpenAPI Generator was forked from Swagger Codegen... [15:16:28] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:16:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 06DC-Ops, 06SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9615336 (10Jclark-ctr) @dcaro thanks for the notes much more productive meeting. although nothing popped out for... [15:17:28] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [15:21:58] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance metricsinfra-puppetserver-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:29:32] 10Toolforge: Tool 'tooltranslate' webservice PHP hangs forever when trying to write to its data path - https://phabricator.wikimedia.org/T336835#9615372 (10Magnus) 05Open→03Resolved a:03Magnus this has resolved itself [15:30:42] 10Toolforge Jobs framework: Kubernetes milli-CPUs are confusing - https://phabricator.wikimedia.org/T341347#9615378 (10Magnus) 05Open→03Resolved a:03Magnus [15:31:16] 14Toolforge (Toolforge iteration 06): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9615381 (10Magnus) 05Resolved→03Open Now happening repeatedly for the `mix-n-match` tool: ` toolforge build start https://github.com/magnusmanske/mixnmatch_rs/ ` [15:34:54] 14Toolforge (Toolforge iteration 06): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9615393 (10dcaro) Just checked the last build passed: ` [step-export] 2024-03-08T15:32:52.375240484Z *** Images (sha256:2068dde1e3e15eef37d5ac0d4b68b68cd6cc42a3a17d80e6c2bfaa60fc380302): [step-ex... [15:35:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [15:36:36] 10Tool-refill: Link to go back to main menu in new reFill GUI - https://phabricator.wikimedia.org/T359637 (10Cocobb8) 03NEW [15:36:58] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance metricsinfra-puppetserver-2 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:37:28] (InstanceDown) firing: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:39:19] 14Toolforge (Toolforge iteration 06): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9615447 (10dcaro) I do see many logs on nginx about no space left: ` root@proxy-03:~# grep mix-n-match /var/log/nginx/error.log | grep 'No space left' | wc 13 312 7363 ` [15:40:13] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: prepare deb packages for k8s 1.24 - https://phabricator.wikimedia.org/T359619#9615456 (10aborrero) 05In progress→03Resolved [15:40:17] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: Upgrade Toolforge Kubernetes to version 1.24 - https://phabricator.wikimedia.org/T307651#9615457 (10aborrero) [15:42:28] (InstanceDown) resolved: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:42:47] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: Upgrade Toolforge Kubernetes to version 1.24 - https://phabricator.wikimedia.org/T307651#9615471 (10aborrero) I plan to start the upgrade on toolsbeta next monday 2024-03-11. [15:43:22] 10Toolforge, 06cloud-services-team: Upgrade Toolforge Kubernetes to version 1.25 - https://phabricator.wikimedia.org/T316107#9615476 (10aborrero) [15:44:01] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolsbeta: upgrade kubernetes to 1.24 - https://phabricator.wikimedia.org/T359638 (10aborrero) 03NEW [15:44:15] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: Upgrade Toolforge Kubernetes to version 1.24 - https://phabricator.wikimedia.org/T307651#9615494 (10aborrero) [15:44:37] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolsbeta: upgrade kubernetes to 1.24 - https://phabricator.wikimedia.org/T359638#9615491 (10aborrero) 05Open→03In progress p:05Triage→03Medium [15:44:59] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: Upgrade Toolforge Kubernetes to version 1.24 - https://phabricator.wikimedia.org/T307651#9615475 (10aborrero) 05Open→03In progress [15:46:10] 14Toolforge (Toolforge iteration 06): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9615497 (10Magnus) Yes, the third attempt passed. However, 500 errors should be cause for concern, even if it eventually runs through. Question: Does "no space left" refer to the mix-n-match too... [15:50:56] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: verify etcd version is compatible with k8s 1.24 - https://phabricator.wikimedia.org/T359639 (10aborrero) 03NEW [15:54:17] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: Upgrade Toolforge Kubernetes to version 1.24 - https://phabricator.wikimedia.org/T307651#9615544 (10aborrero) [15:55:14] 10Toolforge (Toolforge iteration 07), 06cloud-services-team, 15User-aborrero: toolforge: verify etcd version is compatible with k8s 1.24 - https://phabricator.wikimedia.org/T359639#9615542 (10aborrero) 05Open→03Resolved Per https://v1-24.docs.kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-... [15:56:40] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, and 2 others: [etcd] Find a backup solution for the etcd database - https://phabricator.wikimedia.org/T339934#9615553 (10aborrero) What if we backup the data to a cinder volume. I know e... [15:59:22] 10Toolforge, 06cloud-services-team: Upgrade Toolforge Kubernetes to version 1.26 - https://phabricator.wikimedia.org/T327025#9615564 (10aborrero) [16:55:13] 14Toolforge (Toolforge iteration 06): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9615680 (10dcaro) >>! In T358552#9615497, @Magnus wrote: > Yes, the third attempt passed. However, 500 errors should be cause for concern, even if it eventually runs through. > > Question: Does... [16:57:51] 10Toolforge: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version - https://phabricator.wikimedia.org/T359649 (10dcaro) 03NEW p:05Triage→03High [16:59:41] 10Toolforge: [jobs-api] Store user specified command in a label or similar - https://phabricator.wikimedia.org/T359650 (10dcaro) 03NEW p:05Triage→03High [17:03:44] 10Toolforge: [jobs-api] Store user specified command in a label or similar - https://phabricator.wikimedia.org/T359650#9615747 (10aborrero) Beware, label values and similar have limitations on what characters they can store. In the past, I evaluated using something like https://www.crossplane.io/ to create a cu... [17:04:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance metricsinfra-meta-monitor-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [17:05:36] 10Data-Services, 06cloud-services-team: Advice needed: creating a row for every article across every language Wikipedia in ToolsDB - https://phabricator.wikimedia.org/T359564#9615750 (10Audiodude) Awesome, thanks for the feedback. I'll look into Trove. [17:24:28] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-grafana-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [17:24:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:29:50] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:32:36] (03PS4) 10Majavah: Require SUL/Phab links before applying for access [labs/striker] - 10https://gerrit.wikimedia.org/r/1008960 (https://phabricator.wikimedia.org/T172899) [17:43:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:49:28] (PuppetAgentNoResources) firing: (4) No Puppet resources found on instance metricsinfra-grafana-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:04:28] (PuppetAgentNoResources) firing: (4) No Puppet resources found on instance metricsinfra-grafana-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:09:28] (PuppetAgentNoResources) resolved: (3) No Puppet resources found on instance metricsinfra-grafana-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:17:07] 10Data-Services, 06cloud-services-team: Advice needed: creating a row for every article across every language Wikipedia in ToolsDB - https://phabricator.wikimedia.org/T359564#9615943 (10taavi) 05Open→03Resolved a:03taavi [18:21:26] 05Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140#9615960 (10LucasWerkmeister) It looks like the tool is still working with the new limits / requests, so I think we can call this done. Thanks everyone! [18:21:36] 05Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140#9615961 (10LucasWerkmeister) 05Open→03Resolved [18:30:16] 05Grid-Engine-to-K8s-Migration: Migrate addletterboxdfilmidbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357549#9615995 (10Carlinmack) Merge request looks promising thanks :) will test. I have an account for each tool but they are all very similar. I was actually goin... [18:35:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [18:35:50] (ProbeDown) firing: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:40:50] (ProbeDown) resolved: Service tools-static-14:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-14:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:53:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:00:28] (InstanceDown) firing: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:00:36] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616099 (10coldchrist) @dcaro, I made the code changes (just to GANbot.py so far) and tried running the submit command as you give it: toolforge jobs run --image tool-p... [19:50:28] (InstanceDown) resolved: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:08:28] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [20:33:28] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [20:37:12] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616406 (10coldchrist) @dcaro, what is much worse is that now I've reverted the code to the old version, it no longer runs. It's complaining about importlib.metadata:... [20:44:56] 10VPS-Projects, 06cloud-services-team, 10Puppet (Puppet 7.0): Migrate Puppet servers in Cloud Services team managed projects to Puppet 7 - https://phabricator.wikimedia.org/T351453#9616453 (10Andrew) [20:47:22] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616459 (10dcaro) >>! In T357554#9616099, @coldchrist wrote: > @dcaro, I made the code changes (just to GANbot.py so far) and tried running the submit command as you giv... [20:54:48] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616466 (10coldchrist) I tried that, and it's now giving me the same importlib.metadata error as the old code which I suppose at least means I have the syntax right and... [21:21:18] 10PAWS: Update pywikibot to 9.0.0 - https://phabricator.wikimedia.org/T359673 (10JJMC89) 03NEW [21:23:09] 10Toolforge: Update pywikibot image to 9.0.0 - https://phabricator.wikimedia.org/T359674 (10JJMC89) 03NEW [21:26:07] 10PAWS: New upstream release 9.0 for Pywikibot - https://phabricator.wikimedia.org/T359616#9616521 (10taavi) [21:26:11] 10PAWS: Update pywikibot to 9.0.0 - https://phabricator.wikimedia.org/T359673#9616516 (10taavi) Dupe of {T359616}... you know the automation in LibUp for filing these tasks works again, right? [21:26:18] 10PAWS: Update pywikibot to 9.0.0 - https://phabricator.wikimedia.org/T359673#9616519 (10taavi) [21:28:51] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616526 (10dcaro) >>! In T357554#9616466, @coldchrist wrote: > I tried that, and it's now giving me the same importlib.metadata error as the old code which I suppose at... [21:30:03] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616529 (10JJMC89) >>! In T357554#9616406, @coldchrist wrote: > @dcaro, what is much worse is that now I've reverted the code to the old version, it no longer runs. It'... [21:34:28] (InstanceDown) firing: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:35:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on metricsinfra-puppetmaster-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [21:46:30] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T359675#9616559 (10LibUp-bot) [21:46:37] 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T359676#9616561 (10LibUp-bot) [21:47:11] 05Grid-Engine-to-K8s-Migration, 06Growth-Team, 10Community-Tech (CommTech-Kanban): Migrate ERANBOT project off of Grid Engine - https://phabricator.wikimedia.org/T306888#9616566 (10MusikAnimal) Okay, bear with me as I'm not very Python fluent. I'm starting by trying out the `eswiki` job using the k8s python... [21:47:15] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T359675#9616568 (10taavi) [21:47:36] 10PAWS: New upstream release 9.0 for Pywikibot - https://phabricator.wikimedia.org/T359616#9616570 (10taavi) [21:47:41] 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T359676#9616575 (10taavi) [21:48:49] 10Toolforge: Update pywikibot image to 9.0.0 - https://phabricator.wikimedia.org/T359674#9616573 (10taavi) [21:54:28] (InstanceDown) resolved: Project metricsinfra instance metricsinfra-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:58:27] 05Grid-Engine-to-K8s-Migration, 06Growth-Team, 10Community-Tech (CommTech-Kanban): Migrate ERANBOT project off of Grid Engine - https://phabricator.wikimedia.org/T306888#9616594 (10taavi) That sounds like Pip is trying to upgrade itself to a version that does not support Python versions this anchient. Seems... [22:07:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance project-proxy-acme-chief-02 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:12:28] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance maps-proxy-04 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:13:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:20:01] 05Grid-Engine-to-K8s-Migration: Migrate ganfilter from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T357554#9616666 (10coldchrist) And it appears from the bot's logs that it actually stopped running just a few hours before, not long after midnight Pacific time. So when I bega... [22:21:37] 10VPS-Projects, 06cloud-services-team, 10Puppet (Puppet 7.0): Migrate Puppet servers in Cloud Services team managed projects to Puppet 7 - https://phabricator.wikimedia.org/T351453#9616667 (10Andrew) [22:22:28] (PuppetAgentNoResources) resolved: (2) No Puppet resources found on instance maps-proxy-04 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:23:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:45:09] 10Wikibugs, 15User-bd808: Investigate producing a code quality report for GitLab based on flake8 - https://phabricator.wikimedia.org/T359685#9616818 (10bd808) #wikibugs would be a reasonable project to experiment with this on [23:45:34] 10Wikibugs, 15User-bd808: Investigate producing a code quality report for GitLab based on flake8 - https://phabricator.wikimedia.org/T359685#9616820 (10bd808)