[00:02:47] 10VPS-project-Wikistats: Add bewwiktionary to wikistats - https://phabricator.wikimedia.org/T402139#11104943 (10Dzahn) 05Open→03Resolved a:03Dzahn ` MariaDB [wikistats]> insert into wiktionaries (prefix, lang, loclang, method) select prefix,lang,loclang,method from wikipedias where prefix="bew"; Query... [02:15:18] FIRING: [2x] KernelErrors: Server cloudcephosd1048 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [02:15:27] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475 (10phaultfinder) 03NEW [02:20:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [02:21:56] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11105108 (10Andrew) [02:24:04] PROBLEM - Host cloudcephosd1042 is DOWN: PING CRITICAL - Packet loss = 100% [02:26:32] RECOVERY - Host cloudcephosd1042 is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [02:27:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [02:27:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T401693) [02:32:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [02:58:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-80 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:03:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:03:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [03:06:46] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11105131 (10Andrew) [03:07:04] PROBLEM - Host cloudcephosd1042 is DOWN: PING CRITICAL - Packet loss = 100% [03:08:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:09:32] RECOVERY - Host cloudcephosd1042 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [03:10:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:23:50] FIRING: MaxConntrack: Max conntrack at 100% on cloudcephosd1042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [03:23:50] FIRING: CephSlowOps: Ceph cluster in eqiad has 1258 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [03:23:50] FIRING: [2x] ProbeDown: Service toolsbeta-proxy-8:443 has failed probes (http_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-proxy-8:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:23:50] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-54 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:50] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) (T401693) [03:23:50] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [03:23:50] FIRING: [2x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:50] FIRING: [2x] InstanceDown: Project cloudinfra instance cloudinfra-idp-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:50] FIRING: [3x] InstanceDown: Project gitlab-runners instance runner-1031 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:50] FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:23:50] FIRING: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:23:50] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:23:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T401693) [03:23:50] FIRING: TargetDown: Job frontproxy-nginx is unreachable in project toolsbeta instance toolsbeta-proxy-8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:23:50] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T401693) [03:23:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T401693) [03:23:50] RESOLVED: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [03:23:50] FIRING: [6x] SystemdUnitDown: The service unit ceph-osd@64.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:23:50] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T401693) [03:23:59] FIRING: TargetDown: Job prometheus is unreachable in project metricsinfra instance metricsinfra-prometheus-3 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:23:59] FIRING: WidespreadInstanceDown: Widespread instances down in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:24:03] RESOLVED: [3x] InstanceDown: Project gitlab-runners instance runner-1031 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:06] RESOLVED: [5x] InstanceDown: Project toolsbeta instance toolsbeta-acme-chief-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:10] RESOLVED: WidespreadInstanceDown: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:24:14] RESOLVED: TargetDown: Job frontproxy-nginx is unreachable in project toolsbeta instance toolsbeta-proxy-8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:24:19] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:24:32] RESOLVED: [20x] InstanceDown: Project tools instance tools-acme-chief-4 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:35] FIRING: InstanceDown: Project cvn instance cvn-app14 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:39] FIRING: [6x] InstanceDown: Project gitlab-runners instance runner-1031 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:42] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-grafana-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:46] RESOLVED: InstanceDown: Project cvn instance cvn-app14 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:49] FIRING: [66x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:24:56] FIRING: SystemdUnitDown: The service unit ceph-osd@65.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:25:01] FIRING: [4x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:25:08] FIRING: [5x] InstanceDown: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:25:11] FIRING: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:25:15] FIRING: [2x] TargetDown: Job alertmanager is unreachable in project metricsinfra instance metricsinfra-alertmanager-2 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:25:19] FIRING: [4x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:25:22] FIRING: InstanceDown: Project paws instance bastion is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:25:28] RESOLVED: InstanceDown: Project paws instance bastion is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:25:35] RESOLVED: WidespreadInstanceDown: Widespread instances down in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:25:58] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:27:43] RESOLVED: [6x] InstanceDown: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:27:43] RESOLVED: [9x] InstanceDown: Project gitlab-runners instance gitlab-runners-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:28:49] RESOLVED: [4x] InstanceDown: Project cloudinfra instance cloudinfra-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:28:49] RESOLVED: [66x] InstanceDown: Project tools instance tools-acme-chief-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:28:55] RESOLVED: [3x] TargetDown: Job alertmanager is unreachable in project metricsinfra instance metricsinfra-alertmanager-2 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:28:58] RESOLVED: [4x] InstanceDown: Project metricsinfra instance metricsinfra-alertmanager-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:29:56] RESOLVED: SystemdUnitDown: The service unit ceph-osd@65.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:30:04] PROBLEM - Host cloudcephosd1004 is DOWN: PING CRITICAL - Packet loss = 100% [03:30:32] RECOVERY - Host cloudcephosd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [03:32:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T401693) [03:32:42] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T401693) [03:33:23] RESOLVED: [2x] ProbeDown: Service toolsbeta-proxy-8:443 has failed probes (http_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-proxy-8:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:34:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 12605 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [03:35:58] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [03:36:25] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [03:36:58] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:39:41] FIRING: [7x] SystemdUnitDown: The service unit ceph-osd@64.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:46:26] RESOLVED: [2x] SystemdUnitDown: The service unit ceph-osd@64.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:46:30] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-24, tools-k8s-worker-nfs-11, tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-78, tools-k8s-worker-nfs-80 [04:03:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:07:48] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-24, tools-k8s-worker-nfs-11, tools-k8s-worker-nfs-12, tools-k8s-worker-nfs-78, tools-k8s-worker-nfs-80 [04:23:49] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-41 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:28:49] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-41 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:33:49] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-41 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:39:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:43:49] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-41 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:44:28] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-11 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:48:49] FIRING: [10x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProces [04:58:49] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [04:59:28] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-11 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [05:04:28] RESOLVED: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-11 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [05:08:49] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [05:34:56] FIRING: [5x] SystemdUnitDown: The systemd unit ceph-osd@65.service on node cloudcephosd1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:43:49] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:33:49] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:58:49] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:03:49] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:18:49] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:25:28] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki rkiwiki - https://phabricator.wikimedia.org/T392502#11105402 (10taavi) a:03taavi [07:27:59] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki zghwiktionary - https://phabricator.wikimedia.org/T399788#11105413 (10taavi) a:03taavi [07:28:22] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers [07:28:40] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki minwikibooks - https://phabricator.wikimedia.org/T395502#11105416 (10taavi) a:03taavi [07:29:15] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki madwikisource - https://phabricator.wikimedia.org/T391770#11105419 (10taavi) a:03taavi [07:29:56] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki tlwikisource - https://phabricator.wikimedia.org/T388657#11105422 (10taavi) a:05fnegri→03taavi [07:38:49] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:00:18] (03update) 10dcaro: test_runtime: refactor a bit [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/201 [08:02:55] (03merge) 10dcaro: test_runtime: refactor a bit [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/201 [08:03:28] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.406-20250821080027-34d0f3fd [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/933 [08:05:24] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.407-20250821080315-aa0d79f8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/933 [08:11:21] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki tlwikisource - https://phabricator.wikimedia.org/T388657#11105523 (10taavi) 05Open→03Resolved [08:15:29] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [08:33:49] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:47:31] !log filippo@cloudcumin1001 collection-alt-renderer START - Cookbook wmcs.openstack.cloudvirt.vm_console [08:50:02] 06cloud-services-team, 10Cloud-VPS: Audit and potentially fix VMs not reachable by cloudcumin root key - https://phabricator.wikimedia.org/T402185#11105589 (10fgiunchedi) mediawiki2latex.collection-alt-renderer.eqiad1.wikimedia.cloud is 172.16.2.213 in dns but 172.16.0.22 is configured on the host, no wonder i... [08:50:37] !log filippo@cloudcumin1001 collection-alt-renderer END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:50:50] !log filippo@cloudcumin1001 globaleducation START - Cookbook wmcs.openstack.cloudvirt.vm_console [08:51:26] RESOLVED: [4x] SystemdUnitDown: The systemd unit ceph-osd@66.service on node cloudcephosd1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:51:37] !log filippo@cloudcumin1001 globaleducation END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:51:44] 06cloud-services-team, 10Cloud-VPS: Audit and potentially fix VMs not reachable by cloudcumin root key - https://phabricator.wikimedia.org/T402185#11105592 (10fgiunchedi) data-rearchitecture-project-test.globaleducation.eqiad1.wikimedia.cloud has no ip configured on the vm [08:51:49] !log filippo@cloudcumin1001 wmf-research-tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [08:52:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-76 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:52:36] 06cloud-services-team, 10Cloud-VPS: Audit and potentially fix VMs not reachable by cloudcumin root key - https://phabricator.wikimedia.org/T402185#11105607 (10fgiunchedi) reader-embedding.wmf-research-tools.eqiad1.wikimedia.cloud has no ip configured on the vm [08:54:28] 06cloud-services-team, 10Cloud-VPS: Audit and potentially fix VMs not reachable by cloudcumin root key - https://phabricator.wikimedia.org/T402185#11105610 (10fgiunchedi) [08:54:37] !log filippo@cloudcumin1001 wmf-research-tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:54:41] !log filippo@cloudcumin1001 o11y END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:58:48] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105643 (10dcaro) [09:02:28] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-75 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:02:38] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105652 (10dcaro) [09:05:46] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11105663 (10fnegri) [09:05:49] 06cloud-services-team, 06DC-Ops, 06SRE, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11105664 (10fnegri) [09:07:28] FIRING: [3x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-74 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:08:43] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11105683 (10fnegri) Something is wrong on those hosts with the LVM setup: `pvs` returns an empty output. Not sure if the LVM setup is supposed to happen during the reimage or after. For comparison, on an older OSD host: ` fnegri@c... [09:09:13] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11105684 (10fnegri) p:05Triage→03Medium a:03Andrew [09:11:39] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105688 (10dcaro) Looking at the grafana dashboards, noticed that there's a relatively high loss of jumbo frames: {F65811385} It's not new though, but it's worth l... [09:12:28] FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-73 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:13:49] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:16:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-58 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:17:28] FIRING: [4x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-73 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:19:06] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:21:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-58 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:22:28] RESOLVED: [4x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-73 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:23:31] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105737 (10dcaro) The loss of the pings towards 1004, and the current source for lost pings are cloudcephosd1043/44/47, they are not yet in the cluster so that shou... [09:26:52] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105753 (10dcaro) [09:38:31] 06cloud-services-team, 10Toolforge: [Build service] latest builder has old PHP - https://phabricator.wikimedia.org/T401875#11105778 (10fnegri) > If you put 3.13.7 in .python-version then the build will report Using Python version 3.13.7 specified in .python-version and you get the latest version. You're right... [09:43:49] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105794 (10dcaro) Insisting a bit on starting `ceph-osd@65` seemed to get it up and running, maybe there's some "start timeout"? [09:46:29] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105799 (10dcaro) first try to start `ceph-osd@66` failed with the same error: ` Aug 21 09:44:47 cloudcephosd1004 ceph-osd[168404]: ceph-osd: ./src/osd/PeeringState... [09:49:55] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105806 (10dcaro) hmm... I wonder if it has some cache of the `cloudcephosd1042` in the old `v14` version, and when checking the `check_prior_readable_down_osds` it... [09:54:28] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105822 (10dcaro) I have tried extending the systemd unit start timeout to 5 min, see if that helps, though I think it's not getting to the 1m30s default :fingerscr... [09:54:57] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105824 (10dcaro) Still failing, the time it takes is `Aug 21 09:54:26 cloudcephosd1004 systemd[1]: ceph-osd@66.service: Consumed 57.214s CPU time.` [09:58:49] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [10:02:16] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105839 (10dcaro) Hmm.... before crashing, it starts checking old peers: ` Aug 21 09:55:36 cloudcephosd1004 ceph-osd[173450]: 2025-08-21T09:55:36.717+0000 7fcf28c85... [10:02:56] FIRING: SystemdUnitDown: The service unit ceph-osd@66.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:06:31] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [10:08:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [10:14:25] (03open) 10fnegri: Use bookworm image [repos/cloud/cloud-vps/tf-infra-test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/6 [10:15:14] (03update) 10fnegri: Use bookworm image [repos/cloud/cloud-vps/tf-infra-test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/6 [10:19:14] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:27:54] (03approved) 10filippo: Use bookworm image [repos/cloud/cloud-vps/tf-infra-test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/6 (owner: 10fnegri) [10:28:35] (03merge) 10fnegri: Use bookworm image [repos/cloud/cloud-vps/tf-infra-test] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/6 [10:28:49] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [10:33:22] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105869 (10dcaro) ceph-osd@67 came up ok [10:33:49] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [10:34:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-39 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:36:15] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [10:36:39] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105883 (10dcaro) ceph-osd@68 came up ok [10:38:49] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [10:39:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-39 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:39:44] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105886 (10dcaro) ceph-osd@69 came up ok too, only 66 is left down [10:40:54] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11105888 (10dcaro) [10:50:40] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [10:52:13] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-36 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:52:58] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:56:58] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-36 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:35:50] (03approved) 10dcaro: jobs-api: bump to 0.0.407-20250821080315-aa0d79f8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/933 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [11:35:52] (03merge) 10dcaro: jobs-api: bump to 0.0.407-20250821080315-aa0d79f8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/933 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [11:37:56] RESOLVED: SystemdUnitDown: The service unit ceph-osd@66.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:45:56] FIRING: SystemdUnitDown: The service unit ceph-osd@66.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:16:39] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11106089 (10dcaro) p:05Triage→03High [12:17:00] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T402499) [12:17:08] T402499: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499 [12:23:49] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:26:20] (03update) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/42 [12:26:36] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/11 [12:28:43] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all NFS workers [12:43:31] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [12:43:49] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:50:12] (03PS1) 10Ayounsi: Add mock homer password [labs/private] - 10https://gerrit.wikimedia.org/r/1180855 (https://phabricator.wikimedia.org/T402511) [12:50:15] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T402499) [12:50:23] T402499: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499 [12:51:59] (03PS2) 10Ayounsi: Add mock homer password [labs/private] - 10https://gerrit.wikimedia.org/r/1180855 (https://phabricator.wikimedia.org/T402511) [12:53:01] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cookbook,ceph] depool_and_destroy ceph cookbook failed to destroy a single osd - https://phabricator.wikimedia.org/T402515 (10dcaro) 03NEW [12:56:57] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499#11106226 (10dcaro) Manually zapping /dev/sdb on cloudcephosd1004, as the depool_and_destroy cookbook did not do it (see {T402515}): ` root@cloudcephosd1004:~# ls -la... [12:57:52] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T402499) [12:57:59] T402499: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499 [12:58:28] !log dcaro@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T402499) [12:59:56] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cookbook,ceph] bootstrap_and_add ceph cookbook failed to add a new single osd 66 on host cloudcephosd1004 - https://phabricator.wikimedia.org/T402516 (10dcaro) 03NEW [13:00:43] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cookbook,ceph] bootstrap_and_add ceph cookbook failed to add a new single osd 66 on host cloudcephosd1004 - https://phabricator.wikimedia.org/T402516#11106255 (10dcaro) Note that the osd was actually added and it's getting data in, but it did not clear the os... [13:12:26] RESOLVED: SystemdUnitDown: The service unit ceph-osd@66.service is in failed status on host cloudcephosd1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:13:47] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Catalyst: Quota increase request for catalyst-dev - https://phabricator.wikimedia.org/T402521 (10jnuche) 03NEW [13:22:34] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Catalyst: Quota increase request for catalyst-dev - https://phabricator.wikimedia.org/T402521#11106391 (10taavi) Why does the staging environment need to match the production one in resources? I would assume staging would be much less heavily used than t... [13:28:52] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Catalyst: Quota increase request for catalyst-dev - https://phabricator.wikimedia.org/T402521#11106407 (10jnuche) >>! In T402521#11106391, @taavi wrote: > Why does the staging environment need to match the production one in resources? I would assume stag... [14:19:45] (03PS1) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:23:15] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [14:28:07] (03PS2) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:32:23] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [14:34:21] (03PS3) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:35:42] (03PS4) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:35:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:36:02] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [14:37:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:38:13] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [14:38:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:38:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [14:39:34] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [14:39:48] (03PS5) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:39:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:39:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [14:41:34] (03PS6) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:41:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:41:43] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [14:43:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:44:44] (03PS7) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:44:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:47:12] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [14:48:42] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [14:49:21] (03PS8) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [14:50:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:52:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [14:53:35] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [14:58:11] (03PS9) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [15:01:59] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [15:06:25] 10Tools: Update paste.toolforge.org to Stikked 0.12.0 - https://phabricator.wikimedia.org/T189256#11106938 (10bd808) 05Open→03Resolved a:03bd808 The tool is up to date with the latest upstream source, and has been for quite a while at this point. Configuration seems to have been last touched in late 20... [15:09:49] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Block web crawlers from accessing Cloud Services - https://phabricator.wikimedia.org/T226688#11106950 (10Alien333) Note: XTools is now using Anubis in production, and it's worked well. (See conclusions at T400229.) [15:29:53] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [k8s,infra] Upgrade tools to Uwubernetes 1.30 - https://phabricator.wikimedia.org/T402378#11107106 (10dcaro) 05Open→03In progress [15:29:58] (03PS10) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [15:29:59] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [k8s,infra] Upgrade toolsbeta to Uwubernetes 1.30 - https://phabricator.wikimedia.org/T402377#11107110 (10dcaro) 05Open→03In progress [15:30:32] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] support port protocol in config - https://phabricator.wikimedia.org/T401994#11107112 (10dcaro) p:05Triage→03Medium a:03Raymond_Ndibe [15:30:40] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] support port protocol in config - https://phabricator.wikimedia.org/T401994#11107115 (10dcaro) 05Open→03In progress [15:31:04] 10Toolforge (Toolforge iteration 23): [builds-api] Allow queuing builds - https://phabricator.wikimedia.org/T401894#11107134 (10dcaro) p:05Triage→03Medium [15:31:18] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] Allow reusing another component build - https://phabricator.wikimedia.org/T401893#11107135 (10dcaro) p:05Triage→03High [15:31:23] 10Toolforge (Toolforge iteration 23): [components-api] bump the openapi version on every change - https://phabricator.wikimedia.org/T401374#11107136 (10dcaro) p:05Triage→03Medium [15:31:40] 10Toolforge (Toolforge iteration 23): [components-api,beta] Image should only be build once when re-used in components - https://phabricator.wikimedia.org/T401851#11107138 (10dcaro) p:05Triage→03High [15:32:02] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] exclude defaults when getting deployment - https://phabricator.wikimedia.org/T401648#11107140 (10dcaro) p:05Triage→03Medium [15:33:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:34:09] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [15:35:27] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [15:35:49] (03PS11) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [15:39:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:40:11] (03CR) 10CI reject: [V:04-1] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [15:42:20] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [15:45:23] (03PS12) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [15:49:03] 06cloud-services-team, 10Data-Services: Create views for DiscussionTools items tables - https://phabricator.wikimedia.org/T374584#11107277 (10Pppery) 05Resolved→03Open a:05Ladsgroup→03None This was reverted due to {T400420} [15:55:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:55:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [15:57:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:57:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:58:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:58:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [15:59:13] (03CR) 10David Caro: [C:03+1] "LGTM, did not test it though (let me know if you want me to)." [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [15:59:47] 06cloud-services-team, 10Data-Services, 06DBA, 10DiscussionTools, and 5 others: Deleted data available in DiscussionTools tables - https://phabricator.wikimedia.org/T400420#11107368 (10sbassett) p:05Triage→03Medium [16:01:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [16:01:56] 10Cloud-VPS (Project-requests): Request creation of eseap VPS project - https://phabricator.wikimedia.org/T401957#11107407 (10taavi) We talked about this in the WMCS team meeting. We encourage you to take advantage of existing resources, but if you are aware of the extra effort and [[ https://phabricator.wikimed... [16:02:18] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11107408 (10Sakretsu) @fnegri no I'm not, but what should I do if this happens again? Do you want me to run some other commands? I may also run a new job with a different name and... [16:04:25] PROBLEM - Host cloudcephosd1042 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:37] (03CR) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [16:04:47] (03PS13) 10Andrew Bogott: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 [16:05:55] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11107433 (10dcaro) If you have a reproducer would be great, if not, yes, if it happens again and you can leave it for us to inspect would be great, otherwise something like `kubect... [16:06:54] RECOVERY - Host cloudcephosd1042 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [16:08:08] 06cloud-services-team, 10Toolforge: [Build service] latest builder has old PHP - https://phabricator.wikimedia.org/T401875#11107448 (10dcaro) I think we can try updating the 'latest-versions' builder image to add that support, I still have to do a full battery of tests and such. [16:19:56] FIRING: SystemdUnitDown: The service unit wmf_auto_restart_systemd-timesyncd.service is in failed status on host cloudnet1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudnet1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:21:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:22:37] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11107572 (10Sakretsu) Alright, I'll try leaving it in the namespace until you manage to inspect it then. I think this task can be closed for now. I'll reopen it if the issue reoccu... [16:23:25] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11107574 (10fnegri) 05Open→03Resolved a:03fnegri [16:42:48] FIRING: PuppetFailure: Puppet has failed on cloudnet1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:42:55] 06cloud-services-team: PuppetFailure Puppet has failed on cloudnet1005:9100 - https://phabricator.wikimedia.org/T402561 (10phaultfinder) 03NEW [16:57:49] FIRING: [2x] PuppetFailure: Puppet has failed on cloudnet1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:57:58] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T402562 (10phaultfinder) 03NEW [17:00:35] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [jobs-api] make job status an enum, with clearly defined states - https://phabricator.wikimedia.org/T401172#11107754 (10dcaro) After a live discussion, we agreed to the above with the following minor changes: Have a single `status` with the whole st... [17:05:45] 10Toolforge (Toolforge iteration 23): [components-api,beta] Image should only be build once when re-used in components - https://phabricator.wikimedia.org/T401851#11107800 (10dcaro) A simpler option is also doing the queueing on the components-api side, that's probably easier too right now (and does not prevent... [17:06:47] 10Toolforge (Toolforge iteration 23): [components-api] Queue builds when the build queue is full - https://phabricator.wikimedia.org/T402568 (10dcaro) 03NEW [17:08:00] 10Toolforge (Toolforge iteration 23): [components-api] Queue builds when the build queue is full - https://phabricator.wikimedia.org/T402568#11107818 (10dcaro) p:05Triage→03High [17:17:14] 10Toolforge (Toolforge iteration 23): [jobs-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402569 (10dcaro) 03NEW [17:20:06] 10Toolforge (Toolforge iteration 23): [components-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402572 (10dcaro) 03NEW [17:33:08] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T402475#11107962 (10Andrew) The partman recipe for those hosts is pretty minimal, it is puppet/modules/install_server/files/autoinstall/partman/hwraid-1dev-nvme.cfg (written by @elukey, subscribed). It is very simple and doesn't attempt to... [17:37:48] FIRING: [2x] PuppetFailure: Puppet has failed on cloudnet1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:48:45] (03open) 10dcaro: openapi: add the internal server and some description [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/76 (https://phabricator.wikimedia.org/T402032) [18:14:56] FIRING: SystemdUnitDown: The systemd unit wmf_auto_restart_systemd-timesyncd.service on node cloudnet1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudnet1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:15:06] 06cloud-services-team: SystemdUnitDown The systemd unit wmf_auto_restart_systemd-timesyncd.service on node cloudnet1005 has been failing for more than two hours. - https://phabricator.wikimedia.org/T402575 (10phaultfinder) 03NEW [18:19:06] (03update) 10dcaro: openapi: add the internal server and some description [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/76 (https://phabricator.wikimedia.org/T402032) [18:22:48] RESOLVED: PuppetFailure: Puppet has failed on cloudnet1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:30:20] (03update) 10dcaro: openapi: add the internal server and some description [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/76 (https://phabricator.wikimedia.org/T402032) [18:54:56] RESOLVED: SystemdUnitDown: The service unit wmf_auto_restart_systemd-timesyncd.service is in failed status on host cloudnet1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudnet1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:54:56] RESOLVED: SystemdUnitDown: The systemd unit wmf_auto_restart_systemd-timesyncd.service on node cloudnet1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudnet1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:08:03] (03CR) 10Andrew Bogott: [C:03+2] ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [19:12:07] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): wikilink: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402055#11108386 (10Scardenasmolinar) 05Open→03In progress a:03jsn.sherman [19:12:07] (03Merged) 10jenkins-bot: ceph osds: check ceph package version before bootstrap [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1180878 (owner: 10Andrew Bogott) [19:12:25] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): wikilink: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402055#11108391 (10Scardenasmolinar) Noting that the migration is happening right now and some downtime is expected [19:21:58] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-08-21) - https://phabricator.wikimedia.org/T402583 (10sbassett) 03NEW [19:25:42] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/11 (owner: 10l10n-bot) [19:25:44] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/11 (owner: 10l10n-bot) [19:57:29] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-08-21) - https://phabricator.wikimedia.org/T402583#11108519 (10Dzahn) Looks like it has been fixed. works for me (now). [19:59:20] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 10Moderator-Tools-Team (Kanban): wikilink: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402055#11108536 (10jsn.sherman) 05In progress→03Stalled We need to request a quota increase to complete the migra... [20:01:16] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-08-21) - https://phabricator.wikimedia.org/T402583#11108549 (10sbassett) 05Open→03Resolved p:05Triage→03Low a:03sbassett >>! In T402583#11108519, @Dzahn wrote: > Looks like it has been fixed. works for me (now). Hooray! [20:01:35] 10VPS-project-Codesearch: Codesearch down/unreachable (2025-08-21) - https://phabricator.wikimedia.org/T402583#11108571 (10sbassett) a:05sbassett→03None [20:04:08] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 07Epic, 10Moderator-Tools-Team (Kanban): hashtags: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402056#11108591 (10jsn.sherman) 05Open→03In progress a:03jsn.sherman Upon getting stalled in {T402055... [20:07:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:12:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:22:56] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:27:21] 10Tools: Request @framawiki access to spamadmin feature of paste.toolforge.org tool - https://phabricator.wikimedia.org/T189257#11109022 (10Framawiki) 05Open→03Invalid [22:40:13] (03PS1) 10Jacob4code: add seamless multiple searches without needing to refresh or remount the component [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1180997 [22:50:35] (03PS1) 10Jacob4code: Clear duplicate that the previous commit created. [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1181000 [22:54:31] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1181000 (owner: 10Jacob4code) [23:11:51] (03approved) 10eugene233: Update partner logos to be clickable links with target attributes [toolforge-repos/isa] - 10https://gitlab.wikimedia.org/toolforge-repos/isa/-/merge_requests/11 (owner: 10gopavasanth) [23:15:43] (03merge) 10eugene233: Update partner logos to be clickable links with target attributes [toolforge-repos/isa] - 10https://gitlab.wikimedia.org/toolforge-repos/isa/-/merge_requests/11 (owner: 10gopavasanth) [23:18:18] (03approved) 10eugene233: Fix:Long loading times for campaign landing page [toolforge-repos/isa] - 10https://gitlab.wikimedia.org/toolforge-repos/isa/-/merge_requests/10 (owner: 10swayamagrahari) [23:19:01] (03merge) 10eugene233: Fix:Long loading times for campaign landing page [toolforge-repos/isa] - 10https://gitlab.wikimedia.org/toolforge-repos/isa/-/merge_requests/10 (owner: 10swayamagrahari) [23:44:34] (03PS2) 10Jacob4code: add seamless multiple searches without needing to refresh or remount the component [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1180997 (https://phabricator.wikimedia.org/T397019)