[00:11:27] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:11:31] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:16:43] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:19:03] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:40:23] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:44:49] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:10:37] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:27] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [01:11:59] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:12:13] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:13:31] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 123454 bytes in 0.370 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [01:16:23] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:18:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28883 and previous config saved to /var/cache/conftool/dbconfig/20220529-011812-ladsgroup.json [01:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:21] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [01:18:51] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:33:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28884 and previous config saved to /var/cache/conftool/dbconfig/20220529-013317-ladsgroup.json [01:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28885 and previous config saved to /var/cache/conftool/dbconfig/20220529-014822-ladsgroup.json [01:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:01] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:03:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28886 and previous config saved to /var/cache/conftool/dbconfig/20220529-020327-ladsgroup.json [02:03:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [02:03:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [02:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:35] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:37] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:12:53] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:18:03] (03PS1) 10Andrew Bogott: Add dummy passwords for openstack magnum [labs/private] - 10https://gerrit.wikimedia.org/r/800866 [03:18:32] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy passwords for openstack magnum [labs/private] - 10https://gerrit.wikimedia.org/r/800866 (owner: 10Andrew Bogott) [03:20:27] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:29:51] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:30:43] (03PS1) 10Andrew Bogott: haproxy: fix some copy/paste errors with heat frontend ports [puppet] - 10https://gerrit.wikimedia.org/r/800867 [03:30:45] (03PS1) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) [03:32:04] (03CR) 10CI reject: [V: 04-1] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:35:40] (03CR) 10Andrew Bogott: [C: 03+2] haproxy: fix some copy/paste errors with heat frontend ports [puppet] - 10https://gerrit.wikimedia.org/r/800867 (owner: 10Andrew Bogott) [03:36:20] (03PS2) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) [03:36:56] (03CR) 10CI reject: [V: 04-1] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:37:54] (03PS3) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) [03:38:30] (03CR) 10CI reject: [V: 04-1] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:39:22] (03PS4) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) [04:14:51] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:10:57] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:17:47] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:29] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:38:15] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:02:59] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:07:13] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:08:09] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:16:31] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:53] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:17] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220529T0700) [07:03:23] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:30:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [07:30:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [07:30:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [07:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [07:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:11] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:40:39] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28887 and previous config saved to /var/cache/conftool/dbconfig/20220529-074122-ladsgroup.json [07:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:30] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [07:48:27] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P28888 and previous config saved to /var/cache/conftool/dbconfig/20220529-075627-ladsgroup.json [07:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:25] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:11] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P28889 and previous config saved to /var/cache/conftool/dbconfig/20220529-081132-ladsgroup.json [08:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:13] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:26:37] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28890 and previous config saved to /var/cache/conftool/dbconfig/20220529-082637-ladsgroup.json [08:26:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:26:45] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [08:26:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 10 hosts with reason: Maintenance [08:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 10 hosts with reason: Maintenance [08:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:13] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:11:49] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:49] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:27:26] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10RhinosF1) 05Resolved→03Open a:05Jclark-ctr→03None Hey, analytics1068 has been alerting about mega raid. I'm guessing that's expected. Can someone please downtime it if so? [11:55:53] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:04:31] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:14:13] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:24:01] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:29:39] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:29:40] (03CR) 10Gergő Tisza: [C: 04-2] "The backport needs to include Ib34a84d1d3abbce4dcf7433b51abf6e694984c59." [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799388 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [12:56:03] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:04:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:04:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28891 and previous config saved to /var/cache/conftool/dbconfig/20220529-130406-ladsgroup.json [13:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:18] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [13:12:29] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:16:53] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:25:09] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:31:01] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:53] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:18] (ProbeDown) firing: Service zotero:4969 has failed probes (http_zotero_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:55] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:07:19] (ProbeDown) firing: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:07] here, checking [14:09:15] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_4969: Servers kubernetes1008.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:10:06] I forgot we're paging on zotero heh [14:10:12] godog: I acked the page [14:10:25] sobanski: cheers! [14:10:36] Let me know if I can help with anything else [14:10:54] ack, checking logs [14:10:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /api (Zotero and citoid alive) is WARNING: Test Zotero and citoid alive responds with unexpected value at path [0]/itemType = webpage https://wikitech.wikimedia.org/wiki/Citoid [14:11:00] * volans on mobile, not much I can do to help right now [14:11:11] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_4969: Servers kubernetes1022.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:12:18] (ProbeDown) resolved: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:27] mmhh it is back [14:13:00] can't say I understand what happened yet [14:13:06] volans: thanks for showing up [14:13:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:13:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:13:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:16:18] (ProbeDown) resolved: Service zotero:4969 has failed probes (http_zotero_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:15] still looking at k8s logs on logstash [14:19:10] looks like zotero tried and failed to parse some documents, I'm guessing there was enough load to make all containers unavailable [14:19:48] some zotero pods got depooled by kubernetes because readiness probes failed. pods are beeing throttled still quite a lot [14:21:26] *nod* I'm looking at the logs brutally and literally searching for "error" but doesn't seem a very new thing, similar errors occurred in the past https://logstash.wikimedia.org/goto/936f704b43ac50b4bdfad95f58df5797 [14:21:29] but all pods are ready now and readiness probes are happy [14:22:52] I also got the error about parsing css in kubectl get logs [14:23:07] cpu/mem does seem elevated from the dashboard here https://grafana.wikimedia.org/d/2oPtfvXWk/zotero?orgId=1&from=now-24h&to=now&refresh=1m [14:23:50] though we had a similar spike a few hours ago, perhaps not big enough to trigger unavailability [14:27:53] jelto: the only thing I can think of for now is to bump limit/quota a little in case this traffic and/or heavy parsing comes back, what do you think? or possibly even ride it out, not sure there's a whole lot we can do [14:27:56] http 5XX are back to normal but were elevated during the page .. so it seems zotero is just answering slower [14:28:09] when RSS explodes that is [14:29:05] I assume we could scale from 14 replicas higher.. but this could cascade the problem somewhere else. I would say try to ride it out, as there is not elevated 5XX? [14:29:43] btw I'm looking at envoy metrics here https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=zotero&var-destination=All&from=now-6h&to=now [14:30:09] ack, thanks for the dashboard [14:31:50] yeah +1 to ride it out, if it comes back we'll take other measures [14:32:18] * godog back afk [14:33:32] throttleing seems to decrease .. if it comes back we could scale zotero-production deployment in eqiad kubernetes cluster [14:44:52] zotero usage back to normal and no more throttleing.. I'm going afk again [14:48:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:48:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28892 and previous config saved to /var/cache/conftool/dbconfig/20220529-144839-ladsgroup.json [14:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:47] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [14:54:23] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:33] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:10:44] !log cleanup stalled backups on gitlab1001, re-run full backup [15:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:23] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:51:12] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:34] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:04:27] <_joe_> zotero shouldn't be paging [16:15:58] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:06] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:48] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:24] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:22:35] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:37:09] PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [16:37:39] that doesn't look good? [16:39:05] _joe_ you around? ^ [16:43:53] RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [16:54:55] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:55:38] (03PS1) 10Majavah: P:openstack::designate: set base_url to use the https port [puppet] - 10https://gerrit.wikimedia.org/r/800948 (https://phabricator.wikimedia.org/T267194) [16:55:40] (03PS1) 10Majavah: P:openstack::glance: remove primary_image_store concept [puppet] - 10https://gerrit.wikimedia.org/r/800949 [16:55:42] (03PS1) 10Majavah: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950 [16:55:44] (03PS1) 10Majavah: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951 [16:55:46] (03PS1) 10Majavah: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194) [16:55:48] (03PS1) 10Majavah: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194) [16:55:50] (03PS1) 10Majavah: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194) [16:55:52] (03PS1) 10Majavah: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194) [17:14:34] (03PS2) 10Majavah: P:openstack::glance: remove primary_image_store concept [puppet] - 10https://gerrit.wikimedia.org/r/800949 [17:14:37] (03PS2) 10Majavah: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950 [17:14:39] (03PS2) 10Majavah: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951 [17:14:41] (03PS2) 10Majavah: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194) [17:14:43] (03PS2) 10Majavah: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194) [17:14:45] (03PS2) 10Majavah: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194) [17:14:47] (03PS2) 10Majavah: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194) [17:17:17] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:39] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35612/console" [puppet] - 10https://gerrit.wikimedia.org/r/800951 (owner: 10Majavah) [17:18:45] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:32:45] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:19] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:43:17] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:46:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28893 and previous config saved to /var/cache/conftool/dbconfig/20220529-184614-ladsgroup.json [18:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:23] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [18:55:37] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:59:39] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28894 and previous config saved to /var/cache/conftool/dbconfig/20220529-190119-ladsgroup.json [19:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:01] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:16:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28895 and previous config saved to /var/cache/conftool/dbconfig/20220529-191625-ladsgroup.json [19:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:51] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:21:53] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28896 and previous config saved to /var/cache/conftool/dbconfig/20220529-193130-ladsgroup.json [19:31:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:31:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:36] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [19:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28897 and previous config saved to /var/cache/conftool/dbconfig/20220529-193138-ladsgroup.json [19:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:27] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:14:13] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:03] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:40:21] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:15:29] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:22:19] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:31] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:54:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28898 and previous config saved to /var/cache/conftool/dbconfig/20220529-215425-ladsgroup.json [21:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:33] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [21:57:11] (03PS1) 10Ladsgroup: Stop trying to pass legacy page_restrictions to RestrictionStore [extensions/LiquidThreads] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800705 (https://phabricator.wikimedia.org/T309460) [22:09:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28899 and previous config saved to /var/cache/conftool/dbconfig/20220529-220930-ladsgroup.json [22:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:55] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:23:35] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:24:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28900 and previous config saved to /var/cache/conftool/dbconfig/20220529-222435-ladsgroup.json [22:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:41] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28901 and previous config saved to /var/cache/conftool/dbconfig/20220529-223940-ladsgroup.json [22:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:48] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [22:44:29] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:09:06] not sure which channel is appropriate for wikitech sysop requests, but could use some help there with that vandal [23:27:09] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook