[00:01:11] 10Tool-gitlab-account-approval, 10User-bd808: https://gitlab.wikimedia.org/deni not being approved by bot - https://phabricator.wikimedia.org/T356350 (10bd808) Grepping verbose run logs doesn't show `deni` being checked at all. My hunch is that I got the filtering wrong in [[https://gitlab.wikimedia.org/toolfo... [00:04:50] 10Tool-gitlab-account-approval, 10User-bd808: https://gitlab.wikimedia.org/deni not being approved by bot - https://phabricator.wikimedia.org/T356350 (10bd808) `"without_projects": "true"` is the problem. This new user is already in the "toolforge-repos / syncbot" project via #Striker repo creation. Should be... [00:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:11:20] 10Tool-gitlab-account-approval, 10Patch-For-Review, 10User-bd808: https://gitlab.wikimedia.org/deni not being approved by bot - https://phabricator.wikimedia.org/T356350 (10CodeReviewBot) bd808 opened https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/10 gitlab: include us... [00:12:31] 10Tool-gitlab-account-approval, 10Patch-For-Review, 10User-bd808: https://gitlab.wikimedia.org/deni not being approved by bot - https://phabricator.wikimedia.org/T356350 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/10 gitlab: include us... [00:17:51] 10Tool-gitlab-account-approval, 10Patch-For-Review, 10User-bd808: https://gitlab.wikimedia.org/deni not being approved by bot - https://phabricator.wikimedia.org/T356350 (10bd808) 05In progress→03Resolved `lang=shell-session tools.gitlab-account-approval@tools-sgebastion-11:~$ toolforge jobs logs approve... [00:43:08] 10Tool-gitlab-account-approval: gitlab-account-approval bot stalled on 2024-01-09 - https://phabricator.wikimedia.org/T356097 (10bd808) The "glaab.utils INFO: Checking monx94" log line comes from [[https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/blob/9f737eb2f25ab476a5aeef9d3a70b3006ae5976... [00:43:54] 10Tool-gitlab-account-approval, 10User-bd808: gitlab-account-approval bot stalled on 2024-01-09 - https://phabricator.wikimedia.org/T356097 (10bd808) p:05Triage→03Medium a:03bd808 [01:21:32] 10Horizon, 10cloud-services-team: Horizon identity -> roles link logs user out when unauthorized - https://phabricator.wikimedia.org/T356162 (10bd808) [01:31:08] 10Striker, 10Infrastructure-Foundations, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10bd808) 05Resolved→03Open [01:31:16] 10cloud-services-team, 10wikitech.wikimedia.org, 10Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 (10bd808) [01:31:59] 10Striker, 10Infrastructure-Foundations, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10bd808) Reopened because #Striker needs to be made to work with the new schema and storage. [01:37:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:42:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:45:07] (03PS3) 10DannyS712: releases: Bump Codesniffer to 43.0.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/993504 (https://phabricator.wikimedia.org/T353909) [01:45:09] (03CR) 10DannyS712: [C: 04-2] "Blocked on T353909 - either LibUp needs to handle the composer plugin or we need to not try to load it" [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/993504 (https://phabricator.wikimedia.org/T353909) (owner: 10DannyS712) [03:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:37:41] (CloudVPSDesignateLeaks) firing: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:42:41] (CloudVPSDesignateLeaks) firing: (2) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:19:57] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:34:57] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:35:26] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:40:11] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:47:41] (CloudVPSDesignateLeaks) firing: (2) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:52:41] (CloudVPSDesignateLeaks) resolved: (2) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:29:27] 10Grid-Engine-to-K8s-Migration: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) [09:29:51] 10Grid-Engine-to-K8s-Migration: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) a:05dschwen→03dcaro [09:30:10] 10Toolforge: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) [09:58:22] 10Grid-Engine-to-K8s-Migration, 10User-revi: Migrate revibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320006 (10revi) I believe I've moved everything I want. It seems like revibot had a job running on gridengine, disabled it. Feel free to close if things look goo... [10:39:37] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Project-Admins: Create project tag for cloud-services-team (FY2023/2024-Q3-Q4) - https://phabricator.wikimedia.org/T356295 (10Peachey88) 05Open→03Resolved a:03Peachey88 Milestone created :) [11:13:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): cumin and cloud-vps instances not working - https://phabricator.wikimedia.org/T347428 (10fnegri) [11:13:55] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Patch-For-Review, 10User-aborrero: cloudgw: add cloud-private subnet support - https://phabricator.wikimedia.org/T338334 (10fnegri) [11:14:17] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] Copy s51698__yetkin.wanted_items on the replica from the primary - https://phabricator.wikimedia.org/T344420 (10fnegri) [11:15:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-aborrero: Cloud VPS: refresh openstack resources grafana dashboard - https://phabricator.wikimedia.org/T333975 (10fnegri) [11:15:20] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) [11:15:28] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] test creating a new replica host - https://phabricator.wikimedia.org/T344717 (10fnegri) [11:15:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Epic, 10Goal, 10User-aborrero: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10fnegri) [11:15:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner - https://phabricator.wikimedia.org/T329709 (10fnegri) [11:16:00] 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-aborrero: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 (10fnegri) [11:16:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 (10fnegri) [11:16:09] 10Toolforge Build Service, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [harbor] Deploy with Helm - https://phabricator.wikimedia.org/T356301 (10fnegri) [11:16:11] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [clouddb-service-puppetmaster-2] Renew puppet CA certificates - https://phabricator.wikimedia.org/T355410 (10fnegri) [11:16:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): [wmcs-cookbook] increase_quota cookbook fails - https://phabricator.wikimedia.org/T352840 (10fnegri) [11:16:15] 10Cloud Services Proposals, 10cloud-services-team (FY2023/2024-Q3-Q4): Decision Request - Incident Response Process - https://phabricator.wikimedia.org/T348887 (10fnegri) [11:16:17] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354294 (10fnegri) [11:16:19] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295 (10fnegri) [11:16:21] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [puppetmaster-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud] puppet failing to run - https://phabricator.wikimedia.org/T353048 (10fnegri) [11:16:24] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [tf-infra-tests] Failing to destroy - volumes stuck - https://phabricator.wikimedia.org/T352895 (10fnegri) [11:16:26] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10fnegri) [11:16:28] 10Toolforge (Toolforge iteration 04), 10Toolforge Build Service, 10cloud-services-team (FY2023/2024-Q3-Q4): [tbs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313 (10fnegri) [11:16:30] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862 (10fnegri) [11:16:32] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [webservice shell] Allow a user to delete/stop all running shell pods - https://phabricator.wikimedia.org/T349733 (10fnegri) [11:16:34] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Documentation: Toolforge admin docs: revise new navigation menu and add category labels - https://phabricator.wikimedia.org/T345109 (10fnegri) [11:16:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] Enable disk failure prediciton - https://phabricator.wikimedia.org/T349694 (10fnegri) [11:16:38] 10Cloud-VPS, 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Observability-Alerting, 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10fnegri) [11:16:40] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [maintain-dbusers] Generate prometheus metrics - https://phabricator.wikimedia.org/T332955 (10fnegri) [11:16:42] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [wmcs-cookbooks] add a cookbook to reboot a cloudservices/cloudlb host - https://phabricator.wikimedia.org/T348841 (10fnegri) [11:16:44] 10Data-Services, 10Quarry, 10cloud-services-team (FY2023/2024-Q3-Q4): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) [11:16:48] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Infrastructure-Foundations: Remove wmcs-admin access from production cumin hosts - https://phabricator.wikimedia.org/T347979 (10fnegri) [11:16:52] 10Toolforge (Toolforge iteration 04), 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [webservice] Error shown when restarting buildpack-based tool - https://phabricator.wikimedia.org/T348312 (10fnegri) [11:16:56] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662 (10fnegri) [11:17:00] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcumin: allow wmcs-admin to run wikireplicas cookbooks and scripts - https://phabricator.wikimedia.org/T347977 (10fnegri) [11:17:08] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph,osd,puppet] getting error from facter for `ceph_disks` fact - https://phabricator.wikimedia.org/T345227 (10fnegri) [11:17:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): Move Cloud VPS control plane alerting to alertmanager - https://phabricator.wikimedia.org/T345294 (10fnegri) [11:17:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-aborrero, 10User-dcaro: [wmcs-cookbooks] changes to openstack cli / auth things broke several cookbooks - https://phabricator.wikimedia.org/T346427 (10fnegri) [11:17:20] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Infrastructure-Foundations, 10Observability-Alerting, 10Patch-For-Review: [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10fnegri) [11:17:24] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-dcaro: OpenStack API response time gets slower over time - https://phabricator.wikimedia.org/T345084 (10fnegri) [11:17:28] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] test failover procedure - https://phabricator.wikimedia.org/T344719 (10fnegri) [11:17:32] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10Patch-For-Review, 10User-dcaro: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter - https://phabricator.wikimedia.org/T343885 (10fnegri) [11:17:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): WMCS cookbooks: provide shared hosts for people without global root privileges - https://phabricator.wikimedia.org/T343330 (10fnegri) [11:17:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudvps] puppetize the terraform tests VM (tf-infra-test) - https://phabricator.wikimedia.org/T341814 (10fnegri) [11:17:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudvps] use a systemd timer for the terraform tests to get logs - https://phabricator.wikimedia.org/T341769 (10fnegri) [11:17:49] 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-aborrero: Agree how to track/find all WMCS tasks that have a common topic, but belong to different projects - https://phabricator.wikimedia.org/T336681 (10fnegri) [11:17:53] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): Trove: tmpdir should be in external volume - https://phabricator.wikimedia.org/T336285 (10fnegri) [11:17:57] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: [etcd] Find a backup solution for the etcd database - https://phabricator.wikimedia.org/T339934 (10fnegri) [11:18:01] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, and 2 others: [helmfile] Toolforge needs helmfile >=/0.145.3, but we have 0.135.0 - https://phabricator.wikimedia.org/T339328 (10fnegri) [11:18:05] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10User-aborrero: wmcs cookbooks: automate reset nova state of a VM - https://phabricator.wikimedia.org/T336678 (10fnegri) [11:18:09] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [horizon] Log in timing out due to nutcracker being stopped - https://phabricator.wikimedia.org/T333561 (10fnegri) [11:18:13] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network - https://phabricator.wikimedia.org/T329778 (10fnegri) [11:18:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240 (10fnegri) [11:18:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): openstack: consider removing references to old hardware from the database - https://phabricator.wikimedia.org/T335978 (10fnegri) [11:18:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820 (10fnegri) [11:18:33] 10Quarry, 10cloud-services-team (FY2023/2024-Q3-Q4), 10superset.wmcloud.org: Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10fnegri) [11:18:39] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: Migrate largest ToolsDB users to Trove - https://phabricator.wikimedia.org/T291782 (10fnegri) [11:18:43] 10cloud-services-team (FY2023/2024-Q3-Q4), 10DC-Ops, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 2 others: [ceph] Getting rack level HA - https://phabricator.wikimedia.org/T297083 (10fnegri) [11:18:47] 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: [openstack] cloudservices are using different source addresses for local vs. remote updates - https://phabricator.wikimedia.org/T350995 (10aborrero) [11:19:05] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: Create a community offering of OpenStack Magnum - https://phabricator.wikimedia.org/T328712 (10fnegri) [11:19:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: Better support for Postgres on Trove - https://phabricator.wikimedia.org/T337396 (10fnegri) [11:20:55] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal, 10Patch-For-Review: Toolforge: Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664 (10fnegri) [11:20:59] 10Cloud Services Proposals, 10Toolforge Build Service, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Team, and 4 others: [Epic] Make Toolforge a proper platform as a service with push-to-deploy and build packs - https://phabricator.wikimedia.org/T194332 (10fnegri) [11:21:07] 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Epic, 10Goal: Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897 (10fnegri) [11:21:11] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [harbor] Create backups and/or replication - https://phabricator.wikimedia.org/T336668 (10fnegri) [11:21:17] 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687 (10fnegri) [11:21:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal, 10Puppet (Puppet 7.0): Migrate Cloud VPS puppet infrastructure to Puppet 7 - https://phabricator.wikimedia.org/T351450 (10fnegri) [11:21:49] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 (10fnegri) [11:21:56] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal, 10User-Marostegui: Improve trove backup/restore - https://phabricator.wikimedia.org/T356291 (10fnegri) [11:22:11] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal, 10Release-Engineering-Team (Priority Backlog 📥): Experiment with WMCS as a k8s provider for gitlab-cloud-runner cluster - https://phabricator.wikimedia.org/T353356 (10fnegri) [11:22:26] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10fnegri) [11:22:44] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [toolsdb] Upgrade to MariaDB 10.6 - https://phabricator.wikimedia.org/T352206 (10fnegri) [11:22:53] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [toolsdb] test creating a new replica host - https://phabricator.wikimedia.org/T344717 (10fnegri) [11:22:57] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [toolsdb] test failover procedure - https://phabricator.wikimedia.org/T344719 (10fnegri) [11:25:33] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): Create a community offering of OpenStack Magnum - https://phabricator.wikimedia.org/T328712 (10fnegri) [11:25:42] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4): Better support for Postgres on Trove - https://phabricator.wikimedia.org/T337396 (10fnegri) [11:33:22] 10Toolforge (Toolforge iteration 04): [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) [11:33:46] 10Cloud-VPS: Better support for Postgres on Trove - https://phabricator.wikimedia.org/T337396 (10fnegri) [11:33:58] 10Cloud-VPS: Create a community offering of OpenStack Magnum - https://phabricator.wikimedia.org/T328712 (10fnegri) [12:00:01] 10Cloud-VPS, 10Toolforge, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Observability-Alerting, 10Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502 (10fgiunchedi) In case it is useful, as part of the icinga migration I've been collecting checks and their... [12:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:16:16] 10PAWS: jupyterlab to 4.0.12 - https://phabricator.wikimedia.org/T356274 (10rook) 05Open→03Resolved [12:16:18] 10PAWS: jupyterlab to 4.0.12 - https://phabricator.wikimedia.org/T356274 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/370 [12:16:29] vivian-rook closed https://github.com/toolforge/paws/pull/370 [12:20:58] 10Toolforge Build Service, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal, 10User-aborrero: [harbor] Deploy with Helm - https://phabricator.wikimedia.org/T356301 (10aborrero) [12:56:10] 10Toolforge Jobs framework, 10User-aborrero: [jobs-api] when running a command with wrong quoting, no logs nor useful feedback is given to the user - https://phabricator.wikimedia.org/T356267 (10aborrero) [12:56:51] 10VPS-project-Codesearch, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Ladsgroup) 05Open→03Resolved [12:57:25] 10Toolforge (Toolforge iteration 04): [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) [12:58:06] 10Toolforge (Toolforge iteration 04), 10User-aborrero: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10aborrero) [13:05:42] 10Toolforge (Toolforge iteration 04), 10cloud-services-team, 10Patch-For-Review: Enable ARC support in Toolforge - https://phabricator.wikimedia.org/T356171 (10taavi) 05In progress→03Resolved [13:05:44] 10Cloud-VPS, 10Toolforge (Toolforge iteration 04), 10cloud-services-team: Ensure Toolforge and Cloud VPS comply with Google's new email sender guidelines - https://phabricator.wikimedia.org/T354112 (10taavi) [13:18:01] 10Toolforge (Toolforge iteration 04), 10User-aborrero: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10tstarling) See T319953#9385479 for a synopsis of panoviewer's operation. Note that zoomviewer is very similar, and migrating it... [13:58:44] 10PAWS: Move prometheus inside of the cluster - https://phabricator.wikimedia.org/T355179 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/371 [13:59:18] vivian-rook opened https://github.com/toolforge/paws/pull/371 [15:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:18:44] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal, 10Release-Engineering-Team (Priority Backlog 📥): Experiment with WMCS as a k8s provider for gitlab-cloud-runner cluster - https://phabricator.wikimedia.org/T353356 (10fnegri) @dduvall we've added this task to the current #cloud-services-team goals [1]. In pr... [15:35:26] 10Toolforge (Toolforge iteration 04), 10User-aborrero: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10dcaro) I'd say that the quicker solution right now and more stable is just to make the call yourself from the lang you are usin... [16:02:51] 10Toolforge (Toolforge iteration 04), 10User-aborrero: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 (10bd808) Related: {T321919} [16:46:24] (03PS1) 10Arturo Borrero Gonzalez: Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 [16:47:28] (03CR) 10David Caro: [C: 03+1] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:40] (03CR) 10David Caro: [C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:44] (03CR) 10Andrew Bogott: [C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:46] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:47:49] (03CR) 10FNegri: [C: 03+1] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:49:24] (03PS2) 10Arturo Borrero Gonzalez: Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 [16:49:53] (03PS3) 10David Caro: Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:50:24] (03CR) 10David Caro: [C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [16:50:57] (03CR) 10David Caro: [V: 03+2 C: 03+2] Revert "aborrero: drop access" [labs/private] - 10https://gerrit.wikimedia.org/r/994971 (owner: 10Arturo Borrero Gonzalez) [18:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:29:11] 10Tool-gitlab-account-approval, 10User-bd808: gitlab-account-approval bot stalled on 2024-01-09 - https://phabricator.wikimedia.org/T356097 (10bd808) [18:46:34] 10PAWS: Remove paws-prometheus-[12] - https://phabricator.wikimedia.org/T356429 (10rook) [19:35:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [19:40:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [19:46:47] 10Cloud-VPS: libup-db02 is in error state - https://phabricator.wikimedia.org/T356435 (10Reedy) [19:59:32] 10Horizon: User Data can be very long - https://phabricator.wikimedia.org/T356437 (10Reedy) [20:16:20] 10Cloud-VPS: libup-db02 is in error state - https://phabricator.wikimedia.org/T356435 (10taavi) The database has filled up, which explains why it does not work: ` /dev/sdb 9.8G 9.3G 0 100% /var/lib/mysql ` But I can't tell if the resize failure is due to that or the guest agent being bad at keeping a... [20:16:52] 10PAWS, 10Patch-For-Review: Move prometheus inside of the cluster - https://phabricator.wikimedia.org/T355179 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/371 [20:16:59] vivian-rook closed https://github.com/toolforge/paws/pull/371 [20:17:28] 10PAWS, 10Patch-For-Review: Move prometheus inside of the cluster - https://phabricator.wikimedia.org/T355179 (10rook) 05In progress→03Resolved [20:25:02] 10Cloud-VPS: libup-db02 is in error state - https://phabricator.wikimedia.org/T356435 (10Reedy) It looks like we might need a disk quota increase, so we can increase the volume on that host, and try and get it back operational? [20:38:48] (03CR) 10BryanDavis: phabricator: Offer to set issue tracker URL in toolinfo (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992146 (owner: 10Majavah) [20:39:11] (03CR) 10BryanDavis: phabricator: Allow setting source repository project field (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 (owner: 10Majavah) [20:40:42] (03PS2) 10BryanDavis: contrib: Improve setup docs a bit [labs/striker] - 10https://gerrit.wikimedia.org/r/992151 (owner: 10Majavah) [20:40:50] (03CR) 10BryanDavis: [C: 03+2] contrib: Improve setup docs a bit [labs/striker] - 10https://gerrit.wikimedia.org/r/992151 (owner: 10Majavah) [20:44:44] (03Merged) 10jenkins-bot: contrib: Improve setup docs a bit [labs/striker] - 10https://gerrit.wikimedia.org/r/992151 (owner: 10Majavah) [20:55:05] 10Cloud-VPS: Automatically install Node.js on cloud instances - https://phabricator.wikimedia.org/T356441 (10Jdlrobson) [20:59:09] 10Striker: Striker dev env fails to start with `manage.py runserver: error: unrecognized arguments: --nostatic` - https://phabricator.wikimedia.org/T355522 (10bd808) Does this somehow only happen on an initial startup/fresh install? The `--nostatic` argument is present and working as expected on my local dev env... [21:02:31] 10wikitech.wikimedia.org, 10LDAP: Change my username on Wikitech - https://phabricator.wikimedia.org/T355249 (10bd808) 05Resolved→03Declined [21:10:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:15:34] 10Striker, 10Infrastructure-Foundations, 10LDAP, 10User-bd808: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10bd808) a:05SLyngshede-WMF→03bd808 Claiming for the needed changes to #striker described in T148048#8802511 [21:15:44] (03PS2) 10BryanDavis: Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [21:16:05] (03CR) 10BryanDavis: [C: 03+2] Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [21:16:40] (03Merged) 10jenkins-bot: Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [21:33:54] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T356448 (10LibUp-bot) [21:34:04] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T356453 (10LibUp-bot) [21:40:23] 10Grid-Engine-to-K8s-Migration: Migrate commons-delinquent from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319640 (10mdaniels5757) 05Open→03Resolved LGTM. [21:40:39] 10Grid-Engine-to-K8s-Migration: Migrate mdanielsbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319885 (10mdaniels5757) 05Open→03Resolved All done and seems to be working. [21:40:49] 10Grid-Engine-to-K8s-Migration: Migrate deletion-notification-bot-2 from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T352564 (10mdaniels5757) 05Open→03Resolved All done, seems to be working. [23:32:31] (03PS1) 10Jforrester: releases: Bump everyone to api-testing 1.6.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/995115 [23:34:21] (03CR) 10Jforrester: [C: 03+2] releases: Bump everyone to api-testing 1.6.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/995115 (owner: 10Jforrester) [23:34:54] (03Merged) 10jenkins-bot: releases: Bump everyone to api-testing 1.6.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/995115 (owner: 10Jforrester) [23:45:50] (03PS2) 10Reedy: Also notify Toolforge on new Pywikibot releases [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/975357 (owner: 10Majavah) [23:45:55] (03CR) 10Reedy: [C: 03+2] Also notify Toolforge on new Pywikibot releases [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/975357 (owner: 10Majavah) [23:46:30] (03Merged) 10jenkins-bot: Also notify Toolforge on new Pywikibot releases [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/975357 (owner: 10Majavah) [23:53:42] (03PS1) 10Jforrester: Bump mediawiki/mediawiki-phan-config to 0.13.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/995119 [23:53:48] (03CR) 10Jforrester: [C: 03+2] Bump mediawiki/mediawiki-phan-config to 0.13.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/995119 (owner: 10Jforrester) [23:54:05] 10Cloud-VPS: libup-db02 is in error state - https://phabricator.wikimedia.org/T356435 (10Reedy) It's running again, but would be nice to make sure this doesn't break everything (again?) in the near future :) [23:54:22] (03Merged) 10jenkins-bot: Bump mediawiki/mediawiki-phan-config to 0.13.0 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/995119 (owner: 10Jforrester)