[00:01:19] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:04:18] (DiskSpace) resolved: Disk space cloudcontrol1006:9100:/ 0.6712% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:04:33] (SystemdUnitDown) resolved: (2) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:06:19] (HAProxyBackendUnavailable) resolved: HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:09:48] (SystemdUnitDown) firing: (3) The service unit logrotate.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:12:40] (GaleraClusterSizeMismatch) firing: (2) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [00:12:49] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:13:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:14:49] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:15:25] RECOVERY - Disk space on cloudcontrol1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudcontrol1006&var-datasource=eqiad+prometheus/ops [00:17:40] (GaleraClusterSizeMismatch) resolved: (2) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [00:17:50] (HAProxyBackendUnavailable) resolved: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:19:49] (SystemdUnitDown) firing: (4) The service unit purge_vm_backup.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:24:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:09:34] (SystemdUnitDown) firing: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:09:39] 10cloud-services-team: SystemdUnitDown Unit purge_vm_backup.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T352625 (10phaultfinder) [03:18:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:37:13] (DiskSpace) firing: Disk space cloudcontrol1006:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:42:03] 10Grid-Engine-to-K8s-Migration: Migrate php-security-checker from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319966 (10Legoktm) 05Open→03Resolved https://gitlab.wikimedia.org/toolforge-repos/php-security-checker/-/commit/c9a4bbd497de86f791cb1d2c05673cf76de6fb1e ` tools... [03:44:49] (SystemdUnitDown) firing: (2) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:49:48] (SystemdUnitDown) firing: (3) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:51:05] PROBLEM - Disk space on cloudcontrol1006 is CRITICAL: DISK CRITICAL - free space: / 0MiB (0% inode=97%): /tmp 0MiB (0% inode=97%): /var/tmp 0MiB (0% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudcontrol1006&var-datasource=eqiad+prometheus/ops [03:54:49] (SystemdUnitDown) firing: (3) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:04:48] (SystemdUnitDown) firing: (3) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:14:11] 10Grid-Engine-to-K8s-Migration: Migrate dbreps from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319665 (10Legoktm) 05Open→03Resolved Should be done now: https://github.com/mzmcbride/database-reports/commit/e78577b7f6bd19c584d78748befaf091c5a50071 [04:16:18] 10Wikibugs, 10Phabricator, 10NewFunctionality-Worktype: Create conduit method to query the feed and return records with relevant details populated instead of just a bunch of phids - https://phabricator.wikimedia.org/T123417 (10Aklapper) I stumbled upon rPHAB586aaa547ade5bf97fa02e2c8e11511b0387b737 which refe... [04:24:43] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:44:48] (SystemdUnitDown) firing: (4) The service unit man-db.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:54:49] (SystemdUnitDown) firing: (4) The service unit man-db.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:02:14] (DiskSpace) resolved: Disk space cloudcontrol1006:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:03:07] PROBLEM - Host cloudcontrol1006 is DOWN: PING CRITICAL - Packet loss = 100% [05:03:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:03:40] (GaleraClusterSizeMismatch) firing: (2) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [05:04:01] RECOVERY - Host cloudcontrol1006 is UP: PING OK - Packet loss = 0%, RTA = 28.01 ms [05:05:10] (SystemdUnitDown) resolved: (4) The service unit man-db.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:08:20] (HAProxyBackendUnavailable) resolved: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:08:40] (GaleraClusterSizeMismatch) resolved: (2) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [05:12:55] RECOVERY - Disk space on cloudcontrol1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudcontrol1006&var-datasource=eqiad+prometheus/ops [05:14:49] (SystemdUnitDown) firing: (4) The service unit man-db.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:24:49] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:09:49] (SystemdUnitDown) firing: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:18:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:24:43] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:32:13] (DiskSpace) firing: Disk space cloudcontrol1007:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:38:19] PROBLEM - Disk space on cloudcontrol1007 is CRITICAL: DISK CRITICAL - free space: / 0MiB (0% inode=97%): /tmp 0MiB (0% inode=97%): /var/tmp 0MiB (0% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudcontrol1007&var-datasource=eqiad+prometheus/ops [08:39:49] (SystemdUnitDown) firing: (2) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:48:47] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10dcaro) p:05Triage→03High [08:50:04] (SystemdUnitDown) firing: (2) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:50:15] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10dcaro) Truncated the log: ` e... [08:52:13] (DiskSpace) resolved: Disk space cloudcontrol1007:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:54:23] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10dcaro) Rebooting cloudcontrol... [08:54:49] (SystemdUnitDown) resolved: (2) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:58:47] RECOVERY - Disk space on cloudcontrol1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudcontrol1007&var-datasource=eqiad+prometheus/ops [09:02:05] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [builds-api] Use admin user credentials for Harbor API auth in dev - https://phabricator.wikimedia.org/T352022 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/141 builds-api: bump... [09:03:42] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [09:03:56] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [09:11:49] !log tf-infra-test dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:11:50] !log tf-infra-test dcaro@urcuchillay END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=97) [09:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tf-infra-test/SAL [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tf-infra-test/SAL [09:15:25] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [09:15:40] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [09:19:54] !log tf-infra-test dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tf-infra-test/SAL [09:23:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:23:27] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [builds-api] Use admin user credentials for Harbor API auth in dev - https://phabricator.wikimedia.org/T352022 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/141 builds-api: bump... [09:41:38] !log etytree dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Etytree/SAL [09:41:49] !log etytree dcaro@urcuchillay END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Etytree/SAL [10:09:49] (SystemdUnitDown) firing: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:26:02] !log tf-infra-test dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [10:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tf-infra-test/SAL [10:28:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:40:06] (03PS1) 10Muehlenhoff: Remove ganeti RAPI dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/979901 (https://phabricator.wikimedia.org/T350686) [10:45:16] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove ganeti RAPI dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/979901 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:46:19] (03PS1) 10Muehlenhoff: Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/979905 [10:56:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/979905 (owner: 10Muehlenhoff) [11:28:42] (03PS1) 10Klausman: hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) [11:30:44] (03CR) 10Elukey: [C: 03+1] hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [12:24:43] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:49:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:09:49] (SystemdUnitDown) firing: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:16:11] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10Andrew) I saved a logfile fro... [14:22:07] 10Toolforge (Toolforge iteration 02): [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417 (10Slst2020) note: lowering a project's quota below the amount of storage it currently uses does not break anything [14:22:17] 10Toolforge (Toolforge iteration 02): [maintain-harbor] Manage project quotas via maintain-harbor - https://phabricator.wikimedia.org/T352417 (10Slst2020) [14:23:54] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10Andrew) Greenlet 3.0 release... [14:32:01] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:35:17] 10Toolforge (Toolforge iteration 02): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10Slst2020) Reminder to add a confirmation prompt and a warning message that all builds will be wiped and the user will need to start a new build [15:23:35] (03PS2) 10David Caro: ceph: add missing cumin_params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969321 [15:23:42] (03PS2) 10David Caro: some fixes, to sort out [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 [15:27:13] (03CR) 10CI reject: [V: 04-1] ceph: add missing cumin_params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969321 (owner: 10David Caro) [15:27:21] (03CR) 10CI reject: [V: 04-1] some fixes, to sort out [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 (owner: 10David Caro) [15:34:25] 10Cloud-VPS, 10SRE, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) Current situation: * We have a separate `rsyslog-receiver` unit/instance with only the receiver bits on centrallog hosts * The fleet is runni... [15:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:50:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:55:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:19:59] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Support 'unmanaged' projects in cloud-vps - https://phabricator.wikimedia.org/T326818 (10Andrew) Notes: 'no puppet, no ldap, no cumin' is really just 'no puppet' since puppet sets up the other things. This can be implemented by a new keystone... [16:51:29] (03PS1) 10BryanDavis: dev: Bump GitLab container to v16.6.1 [labs/striker] - 10https://gerrit.wikimedia.org/r/980001 [17:13:57] 10Grid-Engine-to-K8s-Migration: Migrate superyetkin from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320070 (10Superyetkin) I am still working on this. It may take a few weeks for me to get my scripts working with the new job engine. [17:32:41] (03CR) 10BryanDavis: [C: 03+2] dev: Bump GitLab container to v16.6.1 [labs/striker] - 10https://gerrit.wikimedia.org/r/980001 (owner: 10BryanDavis) [17:35:50] (03Merged) 10jenkins-bot: dev: Bump GitLab container to v16.6.1 [labs/striker] - 10https://gerrit.wikimedia.org/r/980001 (owner: 10BryanDavis) [18:14:34] (SystemdUnitDown) firing: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:19:15] (03PS1) 10Jforrester: releases: Bump Vue from 3.2.37 to 3.3.9, drop compat [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/980017 (https://phabricator.wikimedia.org/T340590) [18:20:27] (03CR) 10Jforrester: [C: 03+2] releases: Bump Vue from 3.2.37 to 3.3.9, drop compat [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/980017 (https://phabricator.wikimedia.org/T340590) (owner: 10Jforrester) [18:21:00] (03Merged) 10jenkins-bot: releases: Bump Vue from 3.2.37 to 3.3.9, drop compat [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/980017 (https://phabricator.wikimedia.org/T340590) (owner: 10Jforrester) [18:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:42:17] 10Tool-Pageviews, 10Data-Engineering-Icebox: Allow users to query mediarequests using a file page link - https://phabricator.wikimedia.org/T244712 (10mforns) @Dominicbm, hi! We Data Products team are reviewing this task now to see what we can do. We realized that there might be some overlap between this task's... [19:11:00] 10Quarry: Allow search within SQL - https://phabricator.wikimedia.org/T352212 (10Aklapper) [20:21:21] 10Cloud-VPS: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) [20:21:47] 10Cloud-VPS: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) [20:24:17] (03PS1) 10Andrew Bogott: WMF hacks: replace key and metadata panels for VM creation [openstack/horizon/horizon] (2023.1) - 10https://gerrit.wikimedia.org/r/980035 (https://phabricator.wikimedia.org/T326818) [20:27:47] 10Cloud-VPS: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) [20:31:25] 10Cloud-VPS: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) [20:32:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal, 10Patch-For-Review: Support 'unmanaged' projects in cloud-vps - https://phabricator.wikimedia.org/T326818 (10Andrew) > Implementing puppetfree VMs can be done by having the cloud-init script skip all the puppet bits based on a metadata flag.... [21:18:17] 10Cloud-VPS: cannot create/update a variety of DNS records - https://phabricator.wikimedia.org/T352713 (10jsn.sherman) [21:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:14:49] (SystemdUnitDown) firing: The systemd unit purge_vm_backup.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:32:03] (InstanceDown) firing: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:06:15] 10Toolforge (Software install/update): Please install hugin-tools and pillow again - https://phabricator.wikimedia.org/T347446 (10tstarling) I filed this task because it was suggested at [[https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#Requires_a_system_library_or_tool_to_be_present|w... [23:24:05] 10Cloud-VPS (Quota-requests): Please delete meet and chat VPS projects - https://phabricator.wikimedia.org/T352727 (10Ladsgroup) [23:28:50] 10Toolforge: Python virtual environment does not seem to get properly activated by a job using the new Jobs framework - https://phabricator.wikimedia.org/T309309 (10Huji) Coming back to this just to memorialize how things finally got to work. ##### Step 0: set up the desired directory structure I ended up sett... [23:28:59] 10Toolforge: Python virtual environment does not seem to get properly activated by a job using the new Jobs framework - https://phabricator.wikimedia.org/T309309 (10Huji) 05Open→03Resolved a:03Huji [23:29:09] 10Grid-Engine-to-K8s-Migration, 10User-Huji: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319800 (10Huji) [23:29:12] 10Toolforge: Python virtual environment does not seem to get properly activated by a job using the new Jobs framework - https://phabricator.wikimedia.org/T309309 (10Huji) [23:32:35] 10cloud-services-team: Shinken is unavailable (404 - no proxy is configured) - https://phabricator.wikimedia.org/T352594 (10valerio.bozzolan) Interesting OK I think we can just drop the link from the documentation. Done! :3 [23:34:46] 10Cloud-VPS (Quota-requests): Please delete meet and chat VPS projects - https://phabricator.wikimedia.org/T352727 (10Aklapper) If that is done should also update https://meta.wikimedia.org/wiki/Discourse#Alternative_chat and https://meta.wikimedia.org/wiki/Wikimedia_Chat and https://meta.wikimedia.org/wiki/Wiki... [23:35:03] 10cloud-services-team: Shinken is unavailable (404 - no proxy is configured) - https://phabricator.wikimedia.org/T352594 (10valerio.bozzolan) 05Open→03Resolved [23:36:56] 10Grid-Engine-to-K8s-Migration: Migrate isprangefinder from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319820 (10SQL) Sorry - missed your email, and with the holidays this has slipped my mind. The offending crontab entry has been commented out. [23:39:12] 10Grid-Engine-to-K8s-Migration: Migrate ipcheck from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319814 (10SQL) 05In progress→03Resolved The offending crontab entries have been disabled. [23:41:41] 10Grid-Engine-to-K8s-Migration: Migrate isprangefinder from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319820 (10SQL) 05In progress→03Resolved