[08:07:07] 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [jobs-api] Split the API, core, and storage and runtime models - https://phabricator.wikimedia.org/T359808#10681629 (10dcaro) [08:26:44] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134 (10dcaro) 03NEW [08:27:55] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10681659 (10dcaro) p:05Triage→03High [08:40:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10681690 (10dcaro) [08:41:00] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T390134) [08:41:01] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T390134) [08:41:07] T390134: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134 [08:41:27] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T390134) [08:52:39] 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [jobs-api] Create storage layer, and save business models in persistent storage - https://phabricator.wikimedia.org/T359650#10681718 (10dcaro) [09:05:14] 10Toolforge (Toolforge iteration 19): [jobs-api] Split the core layer and create the core models - https://phabricator.wikimedia.org/T390135 (10dcaro) 03NEW [09:12:14] 06cloud-services-team, 10Toolforge: [jobs-api] Split the `*Job` API models into three - https://phabricator.wikimedia.org/T390136 (10dcaro) 03NEW [09:14:17] 06cloud-services-team, 10Toolforge: [jobs-api] Introduce deprecation metrics - https://phabricator.wikimedia.org/T390137 (10dcaro) 03NEW [09:17:13] 06cloud-services-team, 10Toolforge: [jobs-api] Generate the openapi definition from the code - https://phabricator.wikimedia.org/T390138 (10dcaro) 03NEW [09:18:24] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [jobs-api] Refactor before webservice support - https://phabricator.wikimedia.org/T359804#10681797 (10dcaro) [09:24:07] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T390124#10681817 (10Keith_D) Appears to be OK now, thanks for looking at. [09:25:59] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T390124#10681819 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae [09:59:00] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#10681919 (10fnegri) I tested the patch https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge... [10:09:07] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T386408#10681978 (10Pintoch) Likely caused by https://github.com/OpenRefine/OpenRefine/issues/7231 [10:09:26] 10PAWS, 10OpenRefine: [bug] Cannot import Refine project file - https://phabricator.wikimedia.org/T314553#10681980 (10Pintoch) Likely caused by https://github.com/OpenRefine/OpenRefine/issues/7231. [11:04:52] 10Tool-openstack-browser: openstack-browser: Show information about networks and subnets - https://phabricator.wikimedia.org/T380082#10682221 (10taavi) 05Open→03Resolved a:03taavi https://openstack-browser.toolforge.org/network/ [11:59:58] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T390134) [12:00:04] T390134: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134 [13:16:50] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [13:17:04] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:17:04] FIRING: [12x] NodeDown: Node cloudcephosd1026 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:17:04] FIRING: [14x] CloudVirtDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [13:17:04] FIRING: TooManyCloudvirtsDown: #page Reduced availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudvirtsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudvirtsDown [13:17:04] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [13:17:04] FIRING: [19x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:17:04] FIRING: [2x] HAProxyServiceUnavailable: HAProxy service radosgw-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [13:17:04] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:17:04] FIRING: [4x] InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:04] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:04] FIRING: [4x] InstanceDown: Project toolsbeta instance toolsbeta-prometheus-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:23] FIRING: InstanceDown: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:23] FIRING: InstanceDown: Project gitlab-runners instance runner-1028 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:23] FIRING: [11x] InstanceDown: Project tools instance tools-k8s-worker-109 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:26] FIRING: PawsNFSDown: No paws nfs services running found - https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsNFSDown [13:18:16] FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [13:18:17] FIRING: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:18:17] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of -1 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [13:18:17] FIRING: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:18:42] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:18:42] RESOLVED: [25x] NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:18:42] FIRING: InstanceDown: Project cvn instance cvn-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:18:42] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-control-12 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:18:42] FIRING: [2x] InstanceDown: Project metricsinfra instance metricsinfra-controller-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:13] FIRING: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] FIRING: [3x] InstanceDown: Project gitlab-runners instance runner-1023 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:13] FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [13:19:13] RESOLVED: [18x] CloudVirtDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [13:19:13] FIRING: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:13] FIRING: [2x] InstanceDown: Project cloudinfra instance enc-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:13] FIRING: [7x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] FIRING: PawsNFSDown: No paws nfs services running found - https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsNFSDown [13:19:13] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of -1 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [13:19:13] FIRING: InstanceDown: Project cvn instance cvn-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] FIRING: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:19:13] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [13:19:13] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [13:19:22] RESOLVED: [19x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:19:35] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [13:20:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 166621 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:21:01] FIRING: [2x] SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:21:09] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [13:21:24] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [13:22:07] RESOLVED: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [13:22:08] RESOLVED: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [13:22:17] RESOLVED: [8x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:17] RESOLVED: [15x] InstanceDown: Project toolsbeta instance toolsbeta-cumin-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:17] RESOLVED: [33x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:20] RESOLVED: [3x] InstanceDown: Project gitlab-runners instance runner-1023 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:24] RESOLVED: [5x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:27] RESOLVED: InstanceDown: Project paws instance paws-nfs-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:31] RESOLVED: WidespreadInstanceDown: Widespread instances down in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:22:34] RESOLVED: WidespreadInstanceDown: Widespread instances down in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:22:38] RESOLVED: InstanceDown: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:41] RESOLVED: [2x] InstanceDown: Project cvn instance cvn-app10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:45] RESOLVED: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:22:48] RESOLVED: WidespreadInstanceDown: Widespread instances down in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:22:52] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:22:56] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:22:58] RESOLVED: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [13:22:59] RESOLVED: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [13:23:56] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:24:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:24:14] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.010 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:25:48] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 30.609 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:25:56] FIRING: [2x] SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:26:09] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [13:26:51] FIRING: RabbitmqNetworkPartition: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition - https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health - https://alerts.wikimedia.org/?q=alertname%3DRabbitmqNetworkPartition [13:28:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:29:36] FIRING: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [13:29:50] 10Tool-openstack-browser: openstack-browser: Enable to run without NFS access - https://phabricator.wikimedia.org/T390191 (10taavi) 03NEW [13:29:52] 10Tool-openstack-browser: openstack-browser: Enable to run without NFS access - https://phabricator.wikimedia.org/T390193 (10taavi) 03NEW [13:30:30] 10Tool-openstack-browser: openstack-browser: Enable to run without NFS access - https://phabricator.wikimedia.org/T390191#10683016 (10taavi) [13:30:38] 10Tool-openstack-browser: openstack-browser: Enable to run without NFS access - https://phabricator.wikimedia.org/T390193#10683018 (10taavi) →14Duplicate dup:03T390191 [13:31:51] FIRING: [3x] RabbitmqNetworkPartition: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition - https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health - https://alerts.wikimedia.org/?q=alertname%3DRabbitmqNetworkPartition [13:32:04] 06cloud-services-team: RabbitmqNetworkPartition A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://phabricator.wikimedia.org/T390190#10683024 (10phaultfinder) [13:32:52] RESOLVED: [2x] HAProxyServiceUnavailable: HAProxy service radosgw-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [13:33:20] RESOLVED: TooManyCloudvirtsDown: #page Reduced availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudvirtsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudvirtsDown [13:33:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:34:36] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [13:34:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:36:40] RESOLVED: PawsNFSDown: No paws nfs services running found - https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsNFSDown [13:39:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:41:51] RESOLVED: [3x] RabbitmqNetworkPartition: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition - https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health - https://alerts.wikimedia.org/?q=alertname%3DRabbitmqNetworkPartition [13:41:56] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:42:37] FIRING: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [13:42:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [13:45:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:46:36] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:46:38] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:46:50] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:47:37] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [13:47:50] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:48:06] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Add function documentation [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1119843 (owner: 10Umherirrender) [13:48:50] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:49:12] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:49:14] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:49:32] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:10] PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:18] FIRING: [2x] KernelErrors: Server cloudcephmon1006 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephmon1006 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:50:23] FIRING: OOM: OOM killer active on cloudcephmon1006:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:50:26] 06cloud-services-team: KernelErrors Server cloudcephmon1006 logged kernel errors - https://phabricator.wikimedia.org/T390198 (10phaultfinder) 03NEW [13:50:32] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:36] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:50] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:54] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:51:12] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:51:22] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:51:50] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:51:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:51:58] PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:52:10] RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:52:45] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for all services [13:52:58] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [13:53:12] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:14] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:50] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:58] RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:54:22] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:36] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:36] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:37] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:40] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:44] PROBLEM - nova-compute proc maximum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:50] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:50] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:54] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:55:14] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:14] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:16] PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:23] RESOLVED: OOM: OOM killer active on cloudcephmon1006:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:55:32] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:36] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:36] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:44] RECOVERY - nova-compute proc maximum on cloudvirt1038 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:58] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:12] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:16] RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:32] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:36] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:59:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:03:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [14:06:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:09:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:10:34] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:34] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:36] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:36] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:37] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:37] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:37] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:37] PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:42] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:58] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:58] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:58] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:00] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:02] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:06] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:08] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:10] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:12] PROBLEM - nova-compute proc minimum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:14] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:16] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:32] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:34] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 2342 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:13:00] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:32] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:36] PROBLEM - nova-compute proc maximum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:36] PROBLEM - nova-compute proc maximum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:37] PROBLEM - nova-compute proc maximum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:40] PROBLEM - nova-compute proc maximum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:40] PROBLEM - nova-compute proc maximum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:42] PROBLEM - nova-compute proc maximum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:52] PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:52] PROBLEM - nova-compute proc maximum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:52] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:53] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:54] PROBLEM - nova-compute proc maximum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:02] PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:10] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:12] PROBLEM - nova-compute proc maximum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:12] PROBLEM - nova-compute proc maximum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:16] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:18] PROBLEM - nova-compute proc maximum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:22] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:36] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:54] RECOVERY - nova-compute proc maximum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:56] FIRING: [2x] SystemdUnitDown: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:18:36] RECOVERY - nova-compute proc maximum on cloudvirt1061 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:19:02] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:19:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:20:04] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:10] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:12] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:12] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:13] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:14] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:18] PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:20] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:32] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:32] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:36] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:36] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:36] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:40] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:41] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:50] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:50] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:54] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:58] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:21:12] RECOVERY - nova-compute proc maximum on cloudvirt1057 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:21:36] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:22:12] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:22:20] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:22:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [14:22:52] RECOVERY - nova-compute proc maximum on cloudvirt1053 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:22:56] FIRING: [4x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:23:10] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:23:40] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:23:42] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:23:52] RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [14:24:04] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:12] RECOVERY - nova-compute proc maximum on cloudvirt1049 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:14] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:16] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:16] PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:22] PROBLEM - nova-compute proc maximum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:22] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:36] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:36] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:37] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:37] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:37] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:40] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:44] PROBLEM - nova-compute proc maximum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:50] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:50] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:58] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:25:12] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:25:22] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:25:36] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:25:44] RECOVERY - nova-compute proc maximum on cloudvirt1038 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:25:54] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:25:58] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:12] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:14] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:14] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:16] RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:32] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:32] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:36] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:37] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:37] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:26:52] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:27:16] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:27:34] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:27:36] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:27:56] FIRING: [4x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:28:08] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:28:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:29:08] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:29:36] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:29:40] RECOVERY - nova-compute proc maximum on cloudvirt1032 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:29:52] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:29:58] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:30:02] RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:30:08] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:30:40] RECOVERY - nova-compute proc maximum on cloudvirt1058 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:30:58] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:31:36] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:31:36] RECOVERY - nova-compute proc maximum on cloudvirt1067 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:31:42] RECOVERY - nova-compute proc maximum on cloudvirt1054 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:32:06] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:32:36] RECOVERY - nova-compute proc maximum on cloudvirt1065 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:32:36] RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:33:22] FIRING: HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:33:46] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-21, tools-k8s-worker-nfs-26, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-72 [14:33:47] !log taavi@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-21, tools-k8s-worker-nfs-26, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-72 [14:33:51] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-21, tools-k8s-worker-nfs-26, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-72 [14:33:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:34:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [14:34:12] RECOVERY - nova-compute proc minimum on cloudvirt1066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:34:14] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:34:18] RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:34:18] RECOVERY - nova-compute proc maximum on cloudvirt1066 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:34:22] RECOVERY - nova-compute proc maximum on cloudvirt1063 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:34:22] RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:34:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [14:38:22] FIRING: [6x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:39:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [14:41:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-21 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:42:56] FIRING: [3x] SystemdUnitDown: The service unit designate-producer.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:43:22] RESOLVED: [5x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:47:56] RESOLVED: [3x] SystemdUnitDown: The service unit designate-producer.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:52:26] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker role in the tools cluster [14:59:56] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-21, tools-k8s-worker-nfs-26, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-72 [15:02:48] !log taavi@cloudcumin1001 tools Added a new k8s worker tools-k8s-worker-111.tools.eqiad1.wikimedia.cloud to the cluster [15:02:48] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker role in the tools cluster [15:24:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:29:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:29:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:34:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:34:28] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:39:28] FIRING: [3x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:44:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [15:44:28] FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:49:28] FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:51:23] 06cloud-services-team, 10Toolforge: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210 (10MBH) 03NEW [15:53:43] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers [15:53:56] !log aborrero@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for all NFS workers [15:59:00] !log root@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers [15:59:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:00:22] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.011 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:00:28] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 4.703 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:04:28] FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:04:56] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:07:54] 10Cloud-VPS (Quota-requests): Add new flavor for dwl project and increase quota - https://phabricator.wikimedia.org/T389711#10683729 (10Andrew) +1 [16:10:06] 06cloud-services-team, 10Toolforge: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210#10683736 (10MBH) On `mbh` tool another one-time job can't be started with this rationale: `ERROR: An internal error occurred while executing this command. Traceback (most recent call las... [16:14:28] FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:19:28] FIRING: [5x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:29:28] RESOLVED: [3x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-21 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:32:52] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "toolsbeta" cluster to k8s 1.29 - https://phabricator.wikimedia.org/T390212 (10fnegri) 03NEW [16:34:07] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "tools" cluster to k8s 1.29 - https://phabricator.wikimedia.org/T390214 (10fnegri) 03NEW [16:34:20] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "tools" cluster to k8s 1.29 - https://phabricator.wikimedia.org/T390214#10683828 (10fnegri) [16:35:40] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#10683830 (10fnegri) [16:36:04] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#10683834 (10fnegri) [16:44:09] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10683889 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2db5921e-9fd3-4768-9222-3e33bdad8325) set by... [16:45:12] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10683907 (10dcaro) [16:46:12] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10683911 (10dcaro) @VRiley-WMF hi! cloudcephosd1029 is ready to get one disk replaced by the dell new one :) It's turned... [16:52:01] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "toolsbeta" cluster to k8s 1.29 - https://phabricator.wikimedia.org/T390212#10683926 (10fnegri) [16:52:54] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Upgrade "tools" cluster to k8s 1.29 - https://phabricator.wikimedia.org/T390214#10683930 (10fnegri) [16:54:56] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:55:48] (03update) 10raymond-ndibe: [jobs-api] create seperate api.py and move flask things there [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [16:58:23] (03update) 10raymond-ndibe: [jobs-api] move core logic to seperate core module [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [17:02:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:03:42] (03update) 10raymond-ndibe: [jobs-api] move core logic to seperate core module [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [17:10:55] (03update) 10raymond-ndibe: [jobs-api] move core logic to seperate core module [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [17:12:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:17:03] RESOLVED: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [17:17:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:22:18] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:26:02] !log root@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for all NFS workers [17:26:32] !log root@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-40, tools-k8s-worker-nfs-33 [17:27:18] RESOLVED: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [17:34:55] !log root@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-40, tools-k8s-worker-nfs-33 [17:35:24] 10Tool-global-search: Add all CirrusSearch filters to Global Search - https://phabricator.wikimedia.org/T344371#10684113 (10EBernhardson) As a completely hacky way to implement this, you could make cirrus do the query building. This wouldn't really be a supported way of doing things, it's utilizing cirrus debug... [17:37:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-40 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:37:59] 06cloud-services-team, 10Toolforge: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210#10684116 (10bd808) There was a [[https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/LX6KDZMQHEL3NZ3DMWQERI2O3YVSDDKM/|full network outage earlier today]] in Clo... [17:41:25] 06cloud-services-team, 10Toolforge: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210#10684120 (10bd808) >>! In T390210#10683736, @MBH wrote: > On `mbh` tool another one-time job can't be started with this rationale: Was that the `ip-map` one-off job that is running now... [17:42:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-40 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:44:25] 06cloud-services-team, 10Toolforge: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210#10684123 (10dcaro) I see all the jobs running now, probably the network outage then? [17:52:21] (03open) 10dcaro: maintain-harbor: increase memory quota for mbh [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/728 (https://phabricator.wikimedia.org/T389733) [17:56:06] 10Toolforge (Quota-requests), 13Patch-For-Review: Increase RAM quota for mbh tool - https://phabricator.wikimedia.org/T389733#10684200 (10dcaro) >>! In T389733#10678994, @MBH wrote: > @dcaro Let's start from 8 or, better, 12 GB, if you could set 12. I don't want to cause you any unnecessary inconvenience and w... [17:57:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-33 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [18:02:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-33 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [18:06:39] 10Toolforge (Quota-requests), 13Patch-For-Review: Increase RAM quota for mbh tool - https://phabricator.wikimedia.org/T389733#10684245 (10MBH) >it might become really hard for your jobs to find a worker to run on I know almost nothing about k8s, so I didn't think such a problem was possible (i don't even know... [18:22:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-40 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:22:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-40 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:23:03] (03update) 10raymond-ndibe: [jobs-api] move core logic to seperate core module [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/91 (https://phabricator.wikimedia.org/T359804) [18:27:18] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-40 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:42:47] 10Toolforge (Quota-requests), 13Patch-For-Review: Increase RAM quota for mbh tool - https://phabricator.wikimedia.org/T389733#10684333 (10dcaro) >>! In T389733#10684245, @MBH wrote: >>it might become really hard for your jobs to find a worker to run on > I know almost nothing about k8s, so I didn't think such... [18:47:04] 06cloud-services-team, 10Toolforge: [toolforge] increase worker sizes in tools - https://phabricator.wikimedia.org/T390228 (10dcaro) 03NEW [18:47:59] 10Toolforge (Quota-requests), 13Patch-For-Review: Increase RAM quota for mbh tool - https://phabricator.wikimedia.org/T389733#10684350 (10dcaro) Created {T390228} to increase worker sizes :) In the meantime, once the quota is increased, you can try using a smaller limit if you see that your job is not getting... [18:48:20] 06cloud-services-team, 10Toolforge: [toolforge] increase worker sizes in tools - https://phabricator.wikimedia.org/T390228#10684353 (10dcaro) See https://phabricator.wikimedia.org/T389733#10684333 for example. [19:01:57] 10Toolforge (Quota-requests), 13Patch-For-Review: Increase RAM quota for mbh tool - https://phabricator.wikimedia.org/T389733#10684411 (10MBH) I thought, limits works not like > you can run a job only when it's possible to allocate you 12 GB right now but > you can run a job in any time, if your job will reque... [19:05:57] 10Toolforge (Quota-requests), 13Patch-For-Review: Increase RAM quota for mbh tool - https://phabricator.wikimedia.org/T389733#10684450 (10MBH) Another question: in last month many of my jobs disappeared from my processes (and wasn't successfully completed) without any output to `.err` files. Is it how process... [19:36:56] FIRING: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:46:56] RESOLVED: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:02:54] 06cloud-services-team, 10Toolforge: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210#10684670 (10MBH) 05Open→03Resolved a:03MBH Yes, looks like everything works now. [20:25:03] 10Tool-toolwatch: Investigate and Resolve QS-Dev Tools Unavailability Issue - https://phabricator.wikimedia.org/T389967#10684728 (10Arcstur) 05Open→03Resolved a:03Arcstur Fixed by https://github.com/wikimediabrasil/quickstatements3/pull/257. [23:14:24] 10Tool-schedule-deployment, 10Deployments, 10Jouncebot, 06Release-Engineering-Team: Consider using JSON content model for deployment calendar - https://phabricator.wikimedia.org/T366880#10685197 (10bd808)