[04:05:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T401693) [04:05:36] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [04:06:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [04:10:06] PROBLEM - Host cloudcephosd1047 is DOWN: PING CRITICAL - Packet loss = 100% [04:12:36] RECOVERY - Host cloudcephosd1047 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [04:13:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:23:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:29:46] FIRING: Primary cloud switch port utilisation over 80%: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary cloud switch port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+port+utilisation+over+80%25 [04:29:46] FIRING: Primary cloud switch inbound port utilisation over 80%: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary cloud switch inbound port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+inbound+port+utilisation+over+80%25 [04:29:51] 06cloud-services-team: Primary cloud switch port utilisation over 80% Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary cloud switch port utilisation over 80% - https://phabricator.wikimedia.org/T402657#11113660 (10phaultfinder) [04:29:54] 06cloud-services-team: Primary cloud switch inbound port utilisation over 80% Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary cloud switch inbound port utilisation over 80% - https://phabricator.wikimedia.org/T402758 (10phaultfinder) 03NEW [04:34:46] RESOLVED: Primary cloud switch port utilisation over 80%: Device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet recovered from Primary cloud switch port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+port+utilisation+over+80%25 [04:34:46] RESOLVED: Primary cloud switch inbound port utilisation over 80%: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary cloud switch inbound port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+inbound+port+utilisation+over+80%25 [07:59:06] (03PS1) 10Muehlenhoff: Add dummy keytabs for new install servers T396487 [labs/private] - 10https://gerrit.wikimedia.org/r/1181638 [08:09:02] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api,beta] Config not updated from remote source - https://phabricator.wikimedia.org/T401868#11113851 (10dcaro) > Components support source_repo / source_path (maybe source_branch) in addition to source_url, which explicitly resolves the l... [08:23:38] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api] allow specifying `source_repo`+`ref` for the config - https://phabricator.wikimedia.org/T402764 (10dcaro) 03NEW [08:32:50] 06cloud-services-team, 10Cloud-VPS: Monitoring/metrics for trove instances - https://phabricator.wikimedia.org/T402738#11113905 (10dcaro) This is related to {T354728}, and potentially we can add the trove metrics to the common CloudVPS project graphs for everyone to access (https://grafana-rw.wmcloud.org/dashb... [08:33:13] 06cloud-services-team, 10Cloud-VPS: Monitoring/metrics for trove instances - https://phabricator.wikimedia.org/T402738#11113907 (10dcaro) p:05Triage→03Medium [08:33:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api] allow specifying `source_repo`+`ref` for the config - https://phabricator.wikimedia.org/T402764#11113908 (10dcaro) p:05Triage→03High [08:40:44] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy keytabs for new install servers T396487 [labs/private] - 10https://gerrit.wikimedia.org/r/1181638 (owner: 10Muehlenhoff) [08:53:38] 06cloud-services-team, 10Toolforge: [lima-kilo] Improve convergence - https://phabricator.wikimedia.org/T402672#11114052 (10dcaro) Ansible has many shortcomings when trying to make it re-entrant, essentially you have to implement most if not all the logic yourself. We had some of that code in lima-kilo in the... [08:56:34] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:46] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:54] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:58] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:57:42] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:57:58] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:58:58] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:18] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:34] PROBLEM - nova-compute proc maximum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:42] PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:58] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:22] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:52] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:54] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:56] FIRING: [11x] SystemdUnitDown: The service unit libvirtd-admin.socket is in failed status on host cloudvirt1070. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:03:52] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:04:42] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:04:54] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:42] PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:08:54] PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:18] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:18] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:18] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:22] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:22] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:23] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:23] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:24] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:25] PROBLEM - nova-compute proc minimum on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:26] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:27] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:28] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:29] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:30] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:31] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:32] PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:33] PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:34] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:35] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:36] RECOVERY - nova-compute proc maximum on cloudvirt1070 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:37] PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:38] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:39] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:42] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:42] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:42] RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:43] RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:44] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:46] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:54] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:54] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:58] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:58] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:59] PROBLEM - nova-compute proc minimum on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:02] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:30] RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:22] RECOVERY - nova-compute proc minimum on cloudvirt1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:26] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:26] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:28] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:28] RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:34] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:35] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:35] RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:42] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:54] RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:58] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:58] RECOVERY - nova-compute proc minimum on cloudvirt1072 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:02] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:18] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:18] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:22] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:22] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:23] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:23] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:24] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:26] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:28] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:28] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:34] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:42] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:42] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:14:46] PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:14:54] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:16:44] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Use cloud-private network and cfssl certs for instance live migrations - https://phabricator.wikimedia.org/T355145#11114127 (10fgiunchedi) [09:18:00] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:26] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:46] RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:54] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:55] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:21:51] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Use cloud-private network and cfssl certs for instance live migrations - https://phabricator.wikimedia.org/T355145#11114134 (10fgiunchedi) A little bumpy since `nova-compute` and `libvirtd` were down during the first puppet run, and `nova-compute` down... [09:27:26] RESOLVED: [43x] SystemdUnitDown: The service unit libvirtd-admin.socket is in failed status on host cloudvirt1051. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:32:47] 06cloud-services-team, 10Cloud-VPS: Evaluate higher level signals for nova troubles rather than paging on nova-compute down - https://phabricator.wikimedia.org/T402778 (10fgiunchedi) 03NEW [09:33:43] (03update) 10dcaro: openapi: add the internal server and some description [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/76 (https://phabricator.wikimedia.org/T402032) [10:01:17] (03update) 10dcaro: openapi: add the internal server and some description [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/76 (https://phabricator.wikimedia.org/T402032) [11:25:40] (03update) 10dcaro: openapi: add the internal server and some description [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/76 (https://phabricator.wikimedia.org/T402032) [11:26:34] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] store the config used for the deployment in the deployment themselves - https://phabricator.wikimedia.org/T400064#11114435 (10dcaro) 05In progress→03Resolved [11:27:40] 06cloud-services-team, 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: https://api.svc.toolforge.org endpoint given in OpenAPI spec returns 403 forbidden errors - https://phabricator.wikimedia.org/T402032#11114436 (10dcaro) a:03dcaro [11:27:45] 06cloud-services-team, 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: https://api.svc.toolforge.org endpoint given in OpenAPI spec returns 403 forbidden errors - https://phabricator.wikimedia.org/T402032#11114441 (10dcaro) 05Open→03In progress [11:28:44] (03update) 10damian: kubectl alias - use blockinfile [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/262 [11:29:03] (03update) 10damian: install-binary-from-url - add checksums for dest [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/263 [11:29:15] (03update) 10damian: harbor - only download and setup once [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/265 [11:29:21] (03update) 10damian: harbor - move restart to handler [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/267 [11:29:50] (03update) 10damian: docker - move restart to handler [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/264 [11:29:59] (03update) 10damian: tool home dir - update permissions [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/268 [11:30:10] (03update) 10damian: deploy components - don't report as changed [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/266 [11:31:40] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787 (10fnegri) 03NEW [11:32:25] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114465 (10Ladsgroup) We have a patch for it ready even! [11:33:12] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114467 (10Ladsgroup) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178899 My plan was to merge this this we... [11:33:19] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114468 (10fnegri) Nice, I missed that! :) [11:34:19] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114471 (10fnegri) That should be fine, with the email to cloud-announce you already planned. [11:35:19] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114473 (10fnegri) [11:36:01] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114475 (10Ladsgroup) >>! In T402787#11114471, @fnegri wrote: > That should be fine, with the email to cloud-announce... [11:38:31] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 13Patch-For-Review: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114479 (10Zabe) >>! In T402787#11114475, @Ladsgroup wrote: >>>! In T402787#11114471, @fnegri w... [11:38:35] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 13Patch-For-Review: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114480 (10fnegri) > I've sent that already last week, didn't I? With corresponding tech news e... [11:39:30] 10Toolforge (Toolforge iteration 23): [components-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402572#11114482 (10dcaro) [11:40:04] 10Toolforge (Toolforge iteration 23): [jobs-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402569#11114484 (10dcaro) [11:41:29] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 13Patch-For-Review: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114486 (10Ladsgroup) I'm actually waiting for this issue of tech news to reach people (later t... [11:42:07] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 13Patch-For-Review: [wikireplicas] Remove rc_new from recentchanges view definitions - https://phabricator.wikimedia.org/T402787#11114487 (10fnegri) Sounds good! Sorry for the noise, I saw that {T36320} was resolved, so I th... [11:43:58] 10Toolforge (Toolforge iteration 23): [jobs-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402569#11114493 (10dcaro) [11:44:20] 10Toolforge (Toolforge iteration 23): [components-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402572#11114494 (10dcaro) [11:44:41] 10Toolforge (Toolforge iteration 23): [jobs-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402569#11114496 (10dcaro) p:05Triage→03High a:03dcaro [11:45:03] 10Toolforge (Toolforge iteration 23): [components-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402572#11114498 (10dcaro) [11:45:13] 10Toolforge (Toolforge iteration 23): [components-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402572#11114501 (10dcaro) p:05Triage→03Medium [11:46:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api,beta] Config not updated from remote source - https://phabricator.wikimedia.org/T401868#11114502 (10DamianZaremba) >>! In T401868#11113851, @dcaro wrote: >> Components support source_repo / source_path (maybe source_branch) in additio... [11:51:34] (03open) 10arthurtaylor: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) [11:55:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api] allow specifying `source_repo`+`ref` for the config - https://phabricator.wikimedia.org/T402764#11114538 (10DamianZaremba) It would be a breaking change, but perhaps: ` source: url: ` ` source: repo_url: branch: main ` That... [11:57:24] (03update) 10arthurtaylor: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) [12:01:55] (03update) 10arthurtaylor: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) [12:06:47] 06cloud-services-team, 10Toolforge: [components-api] split source from config - https://phabricator.wikimedia.org/T402790 (10DamianZaremba) 03NEW [12:07:51] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api] allow specifying `source_repo`+`ref` for the config - https://phabricator.wikimedia.org/T402764#11114583 (10DamianZaremba) I made https://phabricator.wikimedia.org/T402790 as it's not directly related to this, but implementation of t... [12:35:42] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/43 [12:35:43] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/12 [12:53:09] (03open) 10vriaa: feat: Make editor responsive [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/21 [13:05:09] 06cloud-services-team, 10Cloud-VPS, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799 (10Andrew) 03NEW [13:05:30] 06cloud-services-team, 10Cloud-VPS, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114773 (10Andrew) [13:05:36] 06cloud-services-team, 10Cloud-VPS, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114774 (10Andrew) p:05Triage→03Medium [13:07:10] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11114777 (10Andrew) [13:08:27] 06cloud-services-team, 10Cloud-VPS, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114780 (10TheDJ) @Chippyy can you check why warper data and your home directory have this much data stored ? [13:09:53] 06cloud-services-team, 10Cloud-VPS, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114784 (10TheDJ) I also think we can delete the tiles directory, as we no longer run a tiles server as before in that group. Do you agree @dschwen ? [13:16:53] 10Toolforge (Toolforge iteration 23): [jobs-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402569#11114812 (10dcaro) 05Open→03In progress [13:19:00] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114814 (10taavi) [13:19:36] (03open) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [13:20:54] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api] allow specifying `source_repo`+`ref` for the config - https://phabricator.wikimedia.org/T402764#11114822 (10dcaro) +1 for both, though being non-backwards compatible we will have to support both syntaxes for a while [13:36:54] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) (T401693) [13:37:02] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [13:40:09] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114906 (10Chippyy) >>! In T402799#11114779, @TheDJ wrote: > @Chippyy can you check why warper data and your home directory have this much data stored ? /home/warperdata is the main storage for the Wikimaps... [13:48:41] 06cloud-services-team, 10Huggle: huggle-nfs volume filling up - https://phabricator.wikimedia.org/T402806 (10Andrew) 03NEW [13:50:40] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114963 (10dschwen) I don't see 1.4GB in my home dir when I log onto maps-wma2. The big chunk on nfs are map tiles for the WikiMiniAtlas [13:51:41] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11114966 (10dschwen) ` dschwen@maps-wma2:/mnt/nfs/secondary-maps/home/dschwen$ du -sch * 4.0K README.check_apaches 54M apache_heartbeat.log 4.0K apache_heartbeat.sh 13M bin 7.3M git 20K hosts 4.0K install.sh 4... [13:54:13] 06cloud-services-team: wikidumpparse NFS volume filling up - https://phabricator.wikimedia.org/T402807 (10Andrew) 03NEW [13:59:27] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11115023 (10Andrew) [14:14:41] (03open) 10damian: Draft: Add validated type for git urls [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 [14:14:49] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11115071 (10Andrew) ` root@maps-wma2:/home/dschwen# du -h -d1 . 8.0K ./.config 7.3M ./git 13M ./bin 8.0K ./.gnupg 57M ./.cache 772M ./.vscode-server 36K ./.ssh 223M ./.local 4.0K ./.nano 16K ./.myconfig 1.1G... [14:16:23] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11115091 (10TheDJ) > /home/warperdata is the main storage for the Wikimaps Warper application: https://warper.wmflabs.org/ FYI: I don't think we should have 1.8TB in a home directory... this is what we have... [14:16:40] (03update) 10damian: Draft: Add validated type for git urls [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 [14:18:00] (03update) 10damian: Draft: Add validated type for git urls [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 [14:18:03] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11115098 (10dschwen) D'Uh. Ok, I can delete vscode server, but it'll be redownloaded when I connect again. [14:18:18] (03update) 10damian: Draft: Add validated type for git urls [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 [14:21:15] (03update) 10lucaswerkmeister-wmde: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) (owner: 10arthurtaylor) [14:21:25] (03update) 10lucaswerkmeister-wmde: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) (owner: 10arthurtaylor) [14:22:29] (03approved) 10lucaswerkmeister-wmde: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) (owner: 10arthurtaylor) [14:25:12] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11115134 (10Andrew) side-note: this server is due for a rebuild on Trixie. If you wind up doing anything that requires scheduled downtime let me know and we can do the rebuild at the same time [14:28:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [14:32:04] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [components-api] allow specifying `source_repo`+`ref` for the config - https://phabricator.wikimedia.org/T402764#11115161 (10DamianZaremba) First stab at this: https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 (v... [14:33:59] 06cloud-services-team, 10VPS-Projects: wikidumpparse NFS volume filling up - https://phabricator.wikimedia.org/T402807#11115178 (10taavi) [14:34:37] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11115183 (10TheDJ) >>! In T402799#11115098, @dschwen wrote: > D'Uh. Ok, I can delete vscode server, but it'll be redownloaded when I connect again. Instead of worrying about 1.1GB, I suggest we delete: 4.1T... [14:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:39:38] 06cloud-services-team, 10Maps: maps NFS volume filling up - https://phabricator.wikimedia.org/T402799#11115222 (10taavi) >>! In T402799#11115098, @dschwen wrote: > D'Uh. Ok, I can delete vscode server, but it'll be redownloaded when I connect again. The [[ https://code.visualstudio.com/docs/remote/vscode-ser... [14:39:56] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11115223 (10Andrew) [14:54:58] (03open) 10damian: README - drop --workers [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/122 [15:09:25] (03approved) 10audreypenven: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) (owner: 10arthurtaylor) [15:10:36] (03merge) 10audreypenven: Cache job timing information per class rather than per job [toolforge-repos/phpunit-results-cache] - 10https://gitlab.wikimedia.org/toolforge-repos/phpunit-results-cache/-/merge_requests/11 (https://phabricator.wikimedia.org/T402504) (owner: 10arthurtaylor) [16:00:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:03:55] (03open) 10fnegri: Setup pytest, add first test [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/4 [16:27:39] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [16:27:57] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [16:30:11] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 07Epic, 10Moderator-Tools-Team (Kanban): hashtags: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402056#11115842 (10jsn.sherman) After significant cleanup, we won't need to request additional storage. I'... [16:31:55] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [16:37:51] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [16:54:04] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [17:03:53] (03update) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/43 (owner: 10l10n-bot) [17:05:17] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/43 (owner: 10l10n-bot) [17:05:21] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/43 (owner: 10l10n-bot) [17:08:46] (03update) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/12 (owner: 10l10n-bot) [17:10:13] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/12 (owner: 10l10n-bot) [17:10:17] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/12 (owner: 10l10n-bot) [17:11:35] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [17:13:53] 10VPS-Projects, 10Content-Transform-Team (Work In Progress), 07Essential-Work: Request new VPS for Content Transform Team Visual Diff teating - https://phabricator.wikimedia.org/T402836 (10cscott) 03NEW [17:20:52] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [17:30:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [17:30:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [17:31:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [17:33:59] (03open) 10dcaro: dump: skip unset keys [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/124 [17:34:16] (03update) 10dcaro: api: add `include_unset` parameter to get_job and get_jobs [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/205 (https://phabricator.wikimedia.org/T402569) [17:35:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 670 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [17:35:22] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 670 slow ops - https://phabricator.wikimedia.org/T402839 (10phaultfinder) 03NEW [17:35:22] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) [17:36:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [17:37:28] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) [17:38:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [17:38:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [17:38:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-67 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:39:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [17:39:28] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:40:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [17:40:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [17:41:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:43:28] RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-67 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:44:28] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-puppetdb-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:46:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:48:46] FIRING: Primary cloud switch port utilisation over 80%: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary cloud switch port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+port+utilisation+over+80%25 [17:48:50] 06cloud-services-team: Primary cloud switch port utilisation over 80% Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary cloud switch port utilisation over 80% - https://phabricator.wikimedia.org/T402657#11116181 (10phaultfinder) [17:50:00] FIRING: Primary cloud switch inbound port utilisation over 80%: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary cloud switch inbound port utilisation over 80% - https://alerts.wikimedia.org/?q=alertname%3DPrimary+cloud+switch+inbound+port+utilisation+over+80%25 [17:50:12] 06cloud-services-team: Primary cloud switch inbound port utilisation over 80% Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary cloud switch inbound port utilisation over 80% - https://phabricator.wikimedia.org/T402658#11116200 (10phaultfinder) [17:51:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:57:31] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of -1 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [17:57:56] FIRING: SystemdUnitDown: The service unit disable-tool.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:02:56] FIRING: HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [18:17:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 51 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [18:18:55] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:19:26] RESOLVED: SystemdUnitDown: The service unit disable-tool.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:19:26] RESOLVED: HarborComponentDown: A Harbor component is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [18:30:01] RESOLVED: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [18:46:00] (03update) 10vriaa: feat: Make editor responsive [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/21 [19:06:46] (03update) 10vriaa: feat: Make editor responsive [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/21 [19:57:20] 06cloud-services-team, 10VPS-Projects: wikidumpparse NFS volume filling up - https://phabricator.wikimedia.org/T402807#11116695 (10Peachey88) [20:22:39] 10Cloud-VPS (Debian Bullseye Deprecation), 06The-Wikipedia-Library, 07Epic, 10Moderator-Tools-Team (Kanban): hashtags: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402056#11116779 (10jsn.sherman) okay, new instance is up and running and the old instance is shut off; we'l... [21:07:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [21:07:22] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [21:14:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:17:49] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) (T401693) [21:17:57] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [21:24:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:38:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:43:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:18:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-81 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:22:01] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-81 [22:28:06] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-81 [22:33:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-81 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:12:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:22:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on cloudweb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:37:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources