[00:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:19:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:25:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [00:29:00] FIRING: [8x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:31:52] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for JJMC89 - https://phabricator.wikimedia.org/T375041 (10JJMC89) 03NEW [00:34:50] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10155650 (10JJMC89) [00:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:14:00] FIRING: [9x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:50:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:04:00] FIRING: [10x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:19:00] FIRING: [11x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:27:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:34:00] FIRING: [12x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:47:11] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:06:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:11:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:15:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [03:16:56] FIRING: SystemdUnitDown: The service unit opentofu-infra-diff.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:20:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [03:21:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:38:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:46:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:53:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:56:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:56:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:01:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:03:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [04:08:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [04:10:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:15:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:15:39] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:15:54] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:16:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:21:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:24:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:25:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [04:28:39] (03approved) 10raymond-ndibe: [jobs-cli] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-cli] (remove_unknown_keys_in_dump) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/63 (https://phabricator.wikimedia.org/T341066) [04:28:54] (03update) 10raymond-ndibe: [jobs-api] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/115 (https://phabricator.wikimedia.org/T341066) [04:29:03] (03update) 10raymond-ndibe: [jobs-api] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/115 (https://phabricator.wikimedia.org/T341066) [04:44:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:44:11] (03update) 10raymond-ndibe: Draft: [lima-kilo] configure high-availability [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/189 (https://phabricator.wikimedia.org/T374585) [04:46:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:47:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:56:56] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:01:56] FIRING: [3x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:02:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:02:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [05:07:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [05:08:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:11:56] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:12:07] 06cloud-services-team: SystemdUnitDown Unit opentofu-infra-diff.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T375048 (10phaultfinder) 03NEW [05:13:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:16:56] RESOLVED: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:21:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:26:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:30:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [05:31:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:35:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [05:43:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:46:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:50:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:51:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:53:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:55:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:05:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:11:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:16:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:21:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:21:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:34:00] FIRING: [12x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:43:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:46:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:48:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:51:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:53:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [06:58:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [06:58:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [07:16:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:30:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [07:31:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:46:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:51:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:52:51] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision Request: To strictly enforce semantic versioning rules for toolforge services' APIs or not - https://phabricator.wikimedia.org/T373072#10155989 (10Slst2020) I'd go with Option 2 because it seems like the least labor-intensive one, and I... [08:01:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:06:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:09:09] !log dcaro@urcuchillay admin START - Cookbook wmcs.openstack.restart_openstack [08:09:11] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [08:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:09:44] (03update) 10sstefanova: builds-api: pull always for local environment [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/522 (owner: 10dcaro) [08:09:44] (03approved) 10sstefanova: builds-api: pull always for local environment [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/522 (owner: 10dcaro) [08:09:49] (03update) 10sstefanova: builds-api: pull always for local environment [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/522 (owner: 10dcaro) [08:09:49] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [08:12:09] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [08:16:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:16:59] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [08:20:43] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:44] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:45] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:46] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:21:44] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:21:44] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:22:47] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:23:43] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:23:47] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:43] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:44] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:45] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:24:47] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:25:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [08:25:43] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:26:43] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:27:16] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [08:28:37] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:29:43] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:30:43] PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:31:43] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:31:44] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:33:23] PROBLEM - nova-compute proc maximum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:33:43] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:35:43] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:35:44] RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:40:17] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [08:41:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:43:43] RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:43:44] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:43:44] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:43:44] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:45:44] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:45:44] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:46:43] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:47] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:47] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:46:48] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:47:44] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:47:45] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:47:47] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:43] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:44] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:44] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:48:45] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:49:47] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:00] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [08:50:37] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:37] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:45] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:47] PROBLEM - nova-compute proc maximum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:47] PROBLEM - nova-compute proc maximum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:50:48] PROBLEM - nova-compute proc maximum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:43] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:44] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:44] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:44] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:45] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:46] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:47] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:48] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:49] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:50] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:51] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:52] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:53] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:54] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:55] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:56] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:57] PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:58] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:51:59] PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:34] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [08:52:44] PROBLEM - nova-compute proc maximum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:45] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:45] PROBLEM - nova-compute proc maximum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:52:50] PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:44] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:45] PROBLEM - nova-compute proc maximum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:45] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:50] RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:50] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:51] RECOVERY - nova-compute proc maximum on cloudvirt1053 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:53:51] RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:26] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:28] PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:44] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:50] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:54:51] RECOVERY - nova-compute proc maximum on cloudvirt1049 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:44] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:44] PROBLEM - nova-compute proc maximum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:45] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:45] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:46] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:47] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:48] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:49] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:50] PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:51] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:52] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:53] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:55:54] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [08:56:27] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:37] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:45] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:45] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:56] FIRING: SystemdUnitDown: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:59:40] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [08:59:45] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:59:45] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:43] RECOVERY - nova-compute proc maximum on cloudvirt1035 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:44] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:44] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:44] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:45] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:46] RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:49] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:50] RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [09:01:43] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:44] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:45] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:45] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:46] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:46] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:47] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:48] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:56] FIRING: [3x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:02:43] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:45] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:46] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:46] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:44] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:45] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:45] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:46] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:45] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:46] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:49] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:50] RECOVERY - nova-compute proc maximum on cloudvirt1052 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:43] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:45] RECOVERY - nova-compute proc maximum on cloudvirt1058 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:46] RECOVERY - nova-compute proc maximum on cloudvirt1054 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:06:49] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:07:45] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:07:45] RECOVERY - nova-compute proc maximum on cloudvirt1061 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:26] RECOVERY - nova-compute proc maximum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:29] RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:37] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:38] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:59] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [09:16:56] RESOLVED: [3x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:25:39] (03PS1) 10Slyngshede: labsauth: Allow OATH to be disabled for all users. [labs/striker] - 10https://gerrit.wikimedia.org/r/1073742 [09:50:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:51:10] 06cloud-services-team: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T373277#10156387 (10dcaro) 05Open→03Resolved a:03dcaro [09:51:40] 06cloud-services-team, 10Toolforge: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#10156385 (10dcaro) >>! In T360488#10155063, @bd808 wrote: > I have a mostly working solution for this issue in a [[https://gitlab.wikimedia.org/toolforge-repos/bd80... [09:55:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-33 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:59:09] (03open) 10dcaro: change setup order [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/190 [10:00:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-33 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:05:52] (03approved) 10aborrero: change setup order [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/190 (owner: 10dcaro) [10:06:32] (03merge) 10dcaro: builds-api: pull always for local environment [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/522 [10:22:41] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867#10156445 (10dcaro) a:05Raymond_Ndibe→03dcaro [11:01:56] 10Toolforge (Quota-requests): Request increased quota for sqid Toolforge tool - https://phabricator.wikimedia.org/T375070 (10Mmarx) 03NEW [11:30:17] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:30:27] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:30:59] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:31:15] 10Toolforge (Quota-requests): Request increased quota for sqid Toolforge tool - https://phabricator.wikimedia.org/T375070#10156591 (10aborrero) p:05Triage→03Medium +1, LGTM Handy link for whoever in the WMCS team implement this: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Quota_management [11:33:23] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:35:02] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:35:30] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:41:33] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:42:29] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:43:00] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [11:47:28] !log aborrero@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.add_user_to_project for user 'aborrero' in role 'member' [11:47:34] !log aborrero@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'aborrero' in role 'member' [11:49:14] (03merge) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:49:28] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [11:50:56] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [11:51:17] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [11:51:52] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [11:52:50] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [11:53:42] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [11:54:11] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [11:54:52] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [11:55:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:58:16] (03open) 10aborrero: eqiad1: network: secgroups: expand coverage in cloudinfra and project-proxy [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/47 (https://phabricator.wikimedia.org/T374835) [11:58:27] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/47 [11:58:56] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/47 [12:00:50] (03merge) 10aborrero: eqiad1: network: secgroups: expand coverage in cloudinfra and project-proxy [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/47 (https://phabricator.wikimedia.org/T374835) [12:00:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:01:13] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:02:22] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [12:02:30] !log aborrero@cloudcumin1001 project-proxy START - Cookbook wmcs.vps.add_user_to_project for user 'aborrero' in role 'member' [12:02:36] !log aborrero@cloudcumin1001 project-proxy END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'aborrero' in role 'member' [12:02:39] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:04:17] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [12:14:48] (03open) 10aborrero: cloudinfra-codfw1dev: ntp secgroup: fix service rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [12:14:54] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [12:15:35] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [13:19:30] (03update) 10dcaro: tekton: upgrade to v0.60.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/61 [13:26:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:31:10] 06cloud-services-team, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Retire anycast_healthchecker Icinga check - https://phabricator.wikimedia.org/T374842#10157036 (10fgiunchedi) >>! In T374842#10153975, @ssingh wrote: > On perhaps a related note, while it is true that many of the things th... [13:34:58] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit - https://phabricator.wikimedia.org/T375079 (10Ebrahim) 03NEW [13:50:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:05:30] 06cloud-services-team: SystemdUnitDown Unit opentofu-infra-diff.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T375048#10157189 (10aborrero) 05Open→03Resolved a:03aborrero [14:07:58] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10157199 (10aborrero) p:05Triage→03Medium [14:08:06] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10157201 (10aborrero) p:05Triage→03Medium [14:08:12] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10157202 (10aborrero) p:05Triage→03Medium [14:08:24] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10157203 (10aborrero) p:05Triage→03Medium [14:08:57] 06cloud-services-team: CephClusterInUnknown - https://phabricator.wikimedia.org/T374593#10157204 (10dcaro) 05Open→03Resolved a:03dcaro This has been solved now by increasing the timeout for scrape [14:08:58] 10cloud-services-team (FY2024/2025-Q1-Q2): [cloud] Drain B row from cloud* services - https://phabricator.wikimedia.org/T374463#10157208 (10dcaro) p:05Triage→03Medium [14:09:29] 06cloud-services-team, 10SRE Observability (FY2024/2025-Q1): cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10157211 (10lmata) [14:11:08] 06cloud-services-team: PuppetFailure Puppet failure on cloudbackup2004:9100 - https://phabricator.wikimedia.org/T374158#10157218 (10dcaro) 05Open→03Resolved a:03dcaro ` dcaro@cloudbackup2004:~$ sudo run-puppet-agent Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plu... [14:11:43] 06cloud-services-team: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10157215 (10aborrero) 05Open→03Resolved a:03aborrero [14:11:45] 06cloud-services-team: PuppetFailure Puppet failure on cloudcontrol2006-dev:9100 - https://phabricator.wikimedia.org/T373815#10157222 (10aborrero) 05Open→03Resolved a:03aborrero [14:12:34] 06cloud-services-team: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157230 (10aborrero) [14:13:38] 06cloud-services-team, 06DC-Ops, 10ops-eqiad: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157238 (10aborrero) p:05Triage→03Medium [14:13:43] 06cloud-services-team, 06DC-Ops, 10ops-eqiad: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157235 (10aborrero) hey @VRiley-WMF could you please advice what should we do with the memory error in this server? [14:14:32] 06cloud-services-team: IDP/SSO logout behavior is weird - https://phabricator.wikimedia.org/T374123#10157254 (10aborrero) p:05Triage→03Low [14:14:33] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082 (10fnegri) 03NEW [14:15:36] 10cloud-services-team (FY2024/2025-Q1-Q2): Drain C8 rack - https://phabricator.wikimedia.org/T374043#10157277 (10dcaro) p:05Triage→03Medium [14:15:53] 10cloud-services-team (FY2024/2025-Q1-Q2): [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10157279 (10dcaro) p:05Triage→03Medium [14:16:31] 06cloud-services-team, 10Cloud-VPS: openstack: fix missing prometheus metrics - https://phabricator.wikimedia.org/T373878#10157281 (10aborrero) p:05Triage→03Low [14:17:25] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T373657#10157282 (10aborrero) 05Open→03Resolved a:03aborrero [14:17:37] 06cloud-services-team, 10Cloud-VPS: Upgrade cloudlb hosts to bookworm - https://phabricator.wikimedia.org/T375082#10157286 (10fnegri) p:05Triage→03Low [14:17:53] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10157287 (10dcaro) p:05Triage→03High [14:18:55] 06cloud-services-team, 10SRE Observability (FY2024/2025-Q2): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10157290 (10joanna_borun) p:05Triage→03Medium [14:20:11] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit - https://phabricator.wikimedia.org/T375079#10157289 (10Dzahn) Isn't the core issue here that an external service github is used instead of Wikimedia infrastructure (gitlab or gerrit)? [14:20:15] 06cloud-services-team, 10Toolforge: toolforge: make sure we cache in our repos/registries all helm charts and container images used in k8s - https://phabricator.wikimedia.org/T368630#10157292 (10joanna_borun) p:05Triage→03Low [14:21:43] 06cloud-services-team: PuppetFailure Puppet failure on cloudcontrol2006-dev:9100 - https://phabricator.wikimedia.org/T373815#10157295 (10aborrero) [14:21:48] 06cloud-services-team, 10Cloud-VPS, 07Epic: tofu-infra: the cookbook should use a different git tree copy than the main one - https://phabricator.wikimedia.org/T374022#10157296 (10aborrero) [14:26:02] 06cloud-services-team, 10Data-Services, 10Toolforge, 07Performance Issue: NFS storage slow on Toolforge? - https://phabricator.wikimedia.org/T363621#10157313 (10dcaro) 05Open→03Resolved a:03dcaro I'll resolve as it seems it's not happening anymore, please create a new task linking here if you fin... [14:26:12] 10VPS-Projects, 10fundraising-tech-ops, 10Puppet (Puppet 7.0): Update puppet civicrm-prototype puppetmaster - https://phabricator.wikimedia.org/T361595#10157319 (10joanna_borun) [14:26:18] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Migrate per-project Puppet servers to Puppet 7 - https://phabricator.wikimedia.org/T351452#10157323 (10joanna_borun) p:05Triage→03Medium [14:26:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Use BGP to announce VM ranges from cloudnet to cloudgw - https://phabricator.wikimedia.org/T358868#10157327 (10aborrero) p:05Triage→03Low [14:26:24] 06cloud-services-team: PuppetFailure Puppet failure on cloudcontrol2006-dev:9100 - https://phabricator.wikimedia.org/T373815#10157317 (10aborrero) this was an instance of the race condition at {T374022} `lang=shell-session aborrero@cloudcontrol2006-dev:~ $ sudo run-puppet-agent Info: Using environment 'pro... [14:26:32] 06cloud-services-team: Replace or deprecate WMCS uses of report updater - https://phabricator.wikimedia.org/T357856#10157332 (10joanna_borun) p:05Triage→03Low [14:27:09] 06cloud-services-team, 10Toolforge: toolforge k8s automation: introduce option to reboot a node if the uptime is higher than XYZ - https://phabricator.wikimedia.org/T356970#10157336 (10aborrero) p:05Triage→03Medium [14:27:25] 06cloud-services-team: Replace or deprecate WMCS uses of report updater - https://phabricator.wikimedia.org/T357856#10157334 (10aborrero) 05Open→03Resolved a:03aborrero [14:27:29] 06cloud-services-team, 10Toolforge: toolforge k8s automation: introduce option to force reboot each node directly - https://phabricator.wikimedia.org/T356969#10157339 (10aborrero) p:05Triage→03Low [14:27:59] 06cloud-services-team, 10Cloud-VPS: nova-api seems to die after a while, complains of a full listen queue - https://phabricator.wikimedia.org/T354483#10157340 (10aborrero) p:05Triage→03Medium [14:30:03] 06cloud-services-team, 10Observability-Metrics: Rename prometheus/labs datasource in Grafana to prometheus/cloud - https://phabricator.wikimedia.org/T350013#10157343 (10joanna_borun) p:05Triage→03Medium [14:31:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [wmcs-cookbooks] add a cookbook to reboot a cloudservices/cloudlb host - https://phabricator.wikimedia.org/T348841#10157353 (10joanna_borun) p:05Triage→03Low [14:33:02] 06cloud-services-team, 10Toolforge, 06Toolforge-standards-committee, 06Security-Team, and 2 others: Standard process for dealing with public OAuth consumer secrets - https://phabricator.wikimedia.org/T348752#10157358 (10joanna_borun) p:05Triage→03Low [14:34:10] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683#10157384 (10aborrero) p:05Triage→03Low [14:35:25] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Move Cloud VPS control plane alerting to alertmanager - https://phabricator.wikimedia.org/T345294#10157389 (10joanna_borun) p:05Triage→03Medium [14:36:39] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudvps] puppetize the OpenTofu tests VM (tf-infra-test) - https://phabricator.wikimedia.org/T341814#10157390 (10joanna_borun) p:05Triage→03Low [14:37:00] 06cloud-services-team, 10Data-Services: Move replica_cnf_api out of the Puppet repo - https://phabricator.wikimedia.org/T340754#10157391 (10joanna_borun) p:05Triage→03Low [14:37:29] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [helmfile] Toolforge needs helmfile >=/0.145.3, but we have 0.135.0 - https://phabricator.wikimedia.org/T339328#10157393 (10joanna_borun) 05Open→03Resolved [14:38:00] 06cloud-services-team, 10Cloud-VPS, 10Observability-Alerting, 06SRE-OnFire, 10Sustainability (Incident Followup): monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694#10157385 (10dcaro) 05Open→03Resolv... [14:38:01] 06cloud-services-team, 10Toolforge: Toolforge: consider introducing some semantics for persistent storage - https://phabricator.wikimedia.org/T337192#10157396 (10joanna_borun) p:05Triage→03Medium [14:38:30] 06cloud-services-team, 10Toolforge: Toolforge: consider introducing a command line for creating reverse proxies - https://phabricator.wikimedia.org/T337191#10157399 (10aborrero) 05Open→03Declined [14:39:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: wmcs cookbooks: automate reset nova state of a VM - https://phabricator.wikimedia.org/T336678#10157400 (10aborrero) p:05Triage→03Low [14:40:13] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10157416 (10aborrero) p:05Triage→03Medium [14:42:39] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit on Github - https://phabricator.wikimedia.org/T375079#10157429 (10Aklapper) [14:43:01] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit on Github - https://phabricator.wikimedia.org/T375079#10157437 (10Aklapper) 05Invalid→03Declined [14:44:07] 06cloud-services-team, 10Cloud-VPS, 07Epic: [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants - https://phabricator.wikimedia.org/T194333#10157438 (10aborrero) p:05Triage→03Medium [14:44:56] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit on Github - https://phabricator.wikimedia.org/T375079#10157434 (10Aklapper) 05Open→03Invalid Indeed [14:45:58] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation): Cloud-vps Buster deprecation - https://phabricator.wikimedia.org/T331738#10157453 (10joanna_borun) p:05Triage→03High [14:46:20] 06cloud-services-team: Move WMCS servers out of eqiad row B - https://phabricator.wikimedia.org/T330479#10157454 (10joanna_borun) p:05Triage→03Medium [14:47:47] 06cloud-services-team, 10Cloud-VPS: Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758#10157455 (10joanna_borun) p:05Triage→03Low [14:49:18] RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:49:39] 06cloud-services-team, 10Cloud-VPS: sssd permanent failure on integration-agent-docker-1029 - https://phabricator.wikimedia.org/T324934#10157458 (10joanna_borun) p:05Triage→03Low [14:50:39] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10157459 (10jijiki) >>! In T371537#10154673, @dancy wrote: > The updated Firefox add-on is available at https://addons.mozilla.org/en-US/fir... [14:50:44] 06cloud-services-team, 10Cloud-VPS: Volumes stuck in "Reserved" state - https://phabricator.wikimedia.org/T322448#10157460 (10joanna_borun) p:05Triage→03Low [14:50:47] 06cloud-services-team, 10Cloud-VPS: Volumes stuck in "Reserved" state - https://phabricator.wikimedia.org/T322448#10157461 (10aborrero) a:03Andrew [14:51:17] 06cloud-services-team, 06SRE, 06Traffic, 13Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463#10157462 (10joanna_borun) p:05Triage→03Low [14:53:00] 06cloud-services-team, 06SRE, 06Traffic, 13Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463#10157468 (10dcaro) Still some stuff to be changed: https://codesearch.wmcloud.org/search/?q=labweb [14:53:02] 06cloud-services-team, 10Cloud-VPS, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#10157470 (10joanna_borun) p:05Triage→03Medium [14:53:04] 06cloud-services-team, 10Cloud-VPS, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#10157471 (10fnegri) a:03fnegri [14:53:39] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Figure out how to install the WMF CA in Trove VMs - https://phabricator.wikimedia.org/T314888#10157479 (10fnegri) [14:53:43] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Figure out how to install the WMF CA in Trove VMs - https://phabricator.wikimedia.org/T314888#10157480 (10joanna_borun) p:05Triage→03Low [14:55:38] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Figure out how to install the WMF CA in Trove VMs - https://phabricator.wikimedia.org/T314888#10157481 (10dcaro) [14:55:44] 06cloud-services-team, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, 07Epic, 13Patch-For-Review: Streamline WMCS Alerting and Paging - https://phabricator.wikimedia.org/T313444#10157482 (10joanna_borun) p:05Triage→03Medium [15:00:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:03:09] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#10157496 (10fnegri) @Ladsgroup I stumbled upon this old task, not sure if it's still relevant, if yes I need more guidance :) [15:07:22] (03CR) 10David Caro: [C:03+1] "LGTM, let me know if you want me to test it, it will take me some time though." [labs/striker] - 10https://gerrit.wikimedia.org/r/1073742 (owner: 10Slyngshede) [15:07:53] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error - https://phabricator.wikimedia.org/T373740#10157535 (10VRiley-WMF) a:03VRiley-WMF [15:12:31] (03update) 10aborrero: cloudinfra-codfw1dev: ntp secgroup: fix service rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [15:13:31] FIRING: [10x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:14:05] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [15:14:06] (03update) 10aborrero: cloudinfra-codfw1dev: ntp secgroup: fix service rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [15:14:44] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [15:21:35] (03open) 10aborrero: secgroups: have a common description suffix string to all resources [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:21:35] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:21:50] (03open) 10dcaro: toolforge_depoly_mr: set the latest MR as default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/191 [15:21:50] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:21:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:23:59] (03update) 10aborrero: secgroups: have a common description suffix string to all resources [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:24:01] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:24:32] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:26:56] FIRING: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:32] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#10157781 (10fnegri) [15:31:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:37:07] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:37:09] (03update) 10aborrero: secgroups: have a common description suffix string to all resources [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:37:43] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [15:41:21] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Waldir Pimenta (Waldyrious) - https://phabricator.wikimedia.org/T375110 (10waldyrious) 03NEW [15:45:05] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10157848 (10waldyrious) [15:47:35] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for JJMC89 - https://phabricator.wikimedia.org/T375041#10157856 (10bd808) I would like to sponsor @JJMC89 for this process so they can be a part of the #toolforge-standards-committee [15:47:48] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Antonin Delpeuch (Pintoch) - https://phabricator.wikimedia.org/T374995#10157861 (10bd808) I would like to sponsor @Pintoch for this process so they can be a part of the #toolforge-standards-committee [15:48:27] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Lucas Werkmeister - https://phabricator.wikimedia.org/T375001#10157866 (10bd808) I would like to sponsor @LucasWerkmeister for this process so they can be a part of the #toolforge-standards-committee if a new NDA signing is needed. [15:48:58] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Waldir Pimenta (Waldyrious) - https://phabricator.wikimedia.org/T375110#10157868 (10bd808) I would like to sponsor @Waldyrious for this process so they can be a part of the #toolforge-standards-committee [15:53:19] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for SD0001 - https://phabricator.wikimedia.org/T374998#10157877 (10bd808) I would like to sponsor @SD0001 for this process so they can be a part of the #toolforge-standards-committee [15:53:19] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for TheProtonade - https://phabricator.wikimedia.org/T375007#10157878 (10bd808) I would like to sponsor @theprotonade for this process so they can be a part of the #toolforge-standards-committee [16:07:54] 06cloud-services-team, 10Cloud-VPS: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111 (10aborrero) 03NEW [16:34:52] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06Tech-Docs-Team, 07Documentation: WMCS: Document different types of root and admin privileges - https://phabricator.wikimedia.org/T375113 (10fnegri) 03NEW [16:34:56] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06Tech-Docs-Team, 07Documentation: WMCS: Document different types of root and admin privileges - https://phabricator.wikimedia.org/T375113#10158098 (10fnegri) p:05Triage→03Medium [16:38:42] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10158128 (10dancy) https://chromewebstore.google.com/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb?hl=en-US&pli=1 is showing versio... [16:40:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:45:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:57:21] 06cloud-services-team, 10Cloud-VPS: openstack: clarify default security group semantics - https://phabricator.wikimedia.org/T375111#10158251 (10fnegri) I agree on declaring tofu-infra as the source of truth and I'd like if we could drop the keystone hooks and the cookbook-created rules. I see two possible sol... [17:51:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:03:27] 10Tool-quickcategories: QuickCategories background runner sometimes hangs for no apparent reason - https://phabricator.wikimedia.org/T374152#10158749 (10LucasWerkmeister) Alright, the background runner should now have a health check and hopefully restart itself if it gets stuck. [20:21:50] (03CR) 10BryanDavis: [C:03+1] "> Is the 'oath' a typo?" [labs/striker] - 10https://gerrit.wikimedia.org/r/1073742 (owner: 10Slyngshede) [20:49:19] 06Toolforge-standards-committee: Reset members and owners for toolforge-standards-committee@lists.wikimedia.org - https://phabricator.wikimedia.org/T375134 (10bd808) 03NEW [20:49:34] 06Toolforge-standards-committee: Reset members and owners for toolforge-standards-committee@lists.wikimedia.org - https://phabricator.wikimedia.org/T375134#10158886 (10bd808) p:05Triage→03Medium [21:16:47] 10VPS-project-Codesearch: Codesearch should index mw-node-qunit on Github - https://phabricator.wikimedia.org/T375079#10158981 (10Ladsgroup) FWIW, we do have a lot of github repos indexed in codesearch. Just search `gh_repo` in https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/maste... [21:39:48] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06Tech-Docs-Team, 07Documentation: WMCS: Document different types of root and admin privileges - https://phabricator.wikimedia.org/T375113#10159075 (10bd808) There is some related information at https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_user_... [21:50:33] 06Toolforge-standards-committee: Reset members and owners for toolforge-standards-committee@lists.wikimedia.org - https://phabricator.wikimedia.org/T375134#10159106 (10Quiddity) [21:51:19] 06cloud-services-team, 10Toolforge: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#10159107 (10bd808) >>! In T360488#10156385, @dcaro wrote: >>>! In T360488#10155063, @bd808 wrote: >> I have a mostly working solution for this issue in a [[https://... [21:51:41] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:30:08] (03open) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 [22:51:31] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62