[00:11:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:11:31] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:16:43] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:19:03] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:40:23] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:44:49] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:10:37] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:11:27] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[01:11:59] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:12:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:13:31] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 123454 bytes in 0.370 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[01:16:23] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:18:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28883 and previous config saved to /var/cache/conftool/dbconfig/20220529-011812-ladsgroup.json
[01:18:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:21] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[01:18:51] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:33:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28884 and previous config saved to /var/cache/conftool/dbconfig/20220529-013317-ladsgroup.json
[01:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28885 and previous config saved to /var/cache/conftool/dbconfig/20220529-014822-ladsgroup.json
[01:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:00:01] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:03:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28886 and previous config saved to /var/cache/conftool/dbconfig/20220529-020327-ladsgroup.json
[02:03:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[02:03:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[02:03:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:03:35] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[02:03:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:33:37] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:12:53] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:18:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Add dummy passwords for openstack magnum [labs/private] - 10https://gerrit.wikimedia.org/r/800866
[03:18:32] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy passwords for openstack magnum [labs/private] - 10https://gerrit.wikimedia.org/r/800866 (owner: 10Andrew Bogott)
[03:20:27] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:29:51] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:30:43] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy: fix some copy/paste errors with heat frontend ports [puppet] - 10https://gerrit.wikimedia.org/r/800867
[03:30:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792)
[03:32:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[03:35:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] haproxy: fix some copy/paste errors with heat frontend ports [puppet] - 10https://gerrit.wikimedia.org/r/800867 (owner: 10Andrew Bogott)
[03:36:20] <wikibugs>	 (03PS2) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792)
[03:36:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[03:37:54] <wikibugs>	 (03PS3) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792)
[03:38:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[03:39:22] <wikibugs>	 (03PS4) 10Andrew Bogott: Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792)
[04:14:51] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:10:57] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:17:47] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:24:29] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:38:15] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:02:59] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:07:13] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:08:09] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:16:31] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:18:53] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:58:17] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220529T0700)
[07:03:23] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:30:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[07:30:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[07:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
[07:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
[07:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:11] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:40:39] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:41:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28887 and previous config saved to /var/cache/conftool/dbconfig/20220529-074122-ladsgroup.json
[07:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:30] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[07:48:27] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:56:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P28888 and previous config saved to /var/cache/conftool/dbconfig/20220529-075627-ladsgroup.json
[07:56:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:25] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:06:11] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:11:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P28889 and previous config saved to /var/cache/conftool/dbconfig/20220529-081132-ladsgroup.json
[08:11:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:13] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:26:37] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:26:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28890 and previous config saved to /var/cache/conftool/dbconfig/20220529-082637-ladsgroup.json
[08:26:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[08:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[08:26:45] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[08:26:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 10 hosts with reason: Maintenance
[08:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 10 hosts with reason: Maintenance
[08:26:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:13] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:11:49] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:40:49] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:27:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10RhinosF1) 05Resolved→03Open a:05Jclark-ctr→03None Hey, analytics1068 has been alerting about mega raid. I'm guessing that's expected.  Can someone please downtime it if so?
[11:55:53] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:04:31] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:14:13] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:24:01] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:29:39] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:29:40] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 04-2] "The backport needs to include Ib34a84d1d3abbce4dcf7433b51abf6e694984c59." [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799388 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza)
[12:56:03] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:04:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[13:04:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[13:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28891 and previous config saved to /var/cache/conftool/dbconfig/20220529-130406-ladsgroup.json
[13:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:18] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[13:12:29] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:16:53] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:25:09] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:31:01] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:55:53] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:18] <jinxer-wm>	 (ProbeDown) firing: Service zotero:4969 has failed probes (http_zotero_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:06:55] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:07:19] <jinxer-wm>	 (ProbeDown) firing: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:08:07] <godog>	 here, checking
[14:09:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_4969: Servers kubernetes1008.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:10:06] <godog>	 I forgot we're paging on zotero heh
[14:10:12] <sobanski>	 godog: I acked the page
[14:10:25] <godog>	 sobanski: cheers!
[14:10:36] <sobanski>	 Let me know if I can help with anything else
[14:10:54] <godog>	 ack, checking logs
[14:10:59] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /api (Zotero and citoid alive) is WARNING: Test Zotero and citoid alive responds with unexpected value at path [0]/itemType = webpage https://wikitech.wikimedia.org/wiki/Citoid
[14:11:00] * volans on mobile, not much I can do to help right now
[14:11:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_4969: Servers kubernetes1022.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:12:18] <jinxer-wm>	 (ProbeDown) resolved: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:12:27] <godog>	 mmhh it is back
[14:13:00] <godog>	 can't say I understand what happened yet
[14:13:06] <godog>	 volans: thanks for showing up
[14:13:11] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:13:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:13:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:16:18] <jinxer-wm>	 (ProbeDown) resolved: Service zotero:4969 has failed probes (http_zotero_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:17:15] <godog>	 still looking at k8s logs on logstash
[14:19:10] <godog>	 looks like zotero tried and failed to parse some documents, I'm guessing there was enough load to make all containers unavailable
[14:19:48] <jelto>	 some zotero pods got depooled by kubernetes because readiness probes failed. pods are beeing throttled still quite a lot
[14:21:26] <godog>	 *nod* I'm looking at the logs brutally and literally searching for "error" but doesn't seem a very new thing, similar errors occurred in the past https://logstash.wikimedia.org/goto/936f704b43ac50b4bdfad95f58df5797
[14:21:29] <jelto>	 but all pods are ready now and readiness probes are happy
[14:22:52] <jelto>	 I also got the error about parsing css in kubectl get logs
[14:23:07] <godog>	 cpu/mem does seem elevated from the dashboard here https://grafana.wikimedia.org/d/2oPtfvXWk/zotero?orgId=1&from=now-24h&to=now&refresh=1m
[14:23:50] <godog>	 though we had a similar spike a few hours ago, perhaps not big enough to trigger unavailability
[14:27:53] <godog>	 jelto: the only thing I can think of for now is to bump limit/quota a little in case this traffic and/or heavy parsing comes back, what do you think? or possibly even ride it out, not sure there's a whole lot we can do
[14:27:56] <jelto>	 http 5XX are back to normal but were elevated during the page .. so it seems zotero is just answering slower 
[14:28:09] <godog>	 when RSS explodes that is
[14:29:05] <jelto>	 I assume we could scale from 14 replicas higher.. but this could cascade the problem somewhere else. I would say try to ride it out, as there is not elevated 5XX?
[14:29:43] <jelto>	 btw I'm looking at envoy metrics  here https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=zotero&var-destination=All&from=now-6h&to=now
[14:30:09] <godog>	 ack, thanks for the dashboard
[14:31:50] <godog>	 yeah +1 to ride it out, if it comes back we'll take other measures
[14:32:18] * godog back afk
[14:33:32] <jelto>	 throttleing seems to decrease .. if it comes back we could scale zotero-production deployment in eqiad kubernetes cluster 
[14:44:52] <jelto>	 zotero usage back to normal and no more throttleing.. I'm going afk again
[14:48:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[14:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[14:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28892 and previous config saved to /var/cache/conftool/dbconfig/20220529-144839-ladsgroup.json
[14:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:47] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[14:54:23] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:33] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:10:44] <jelto>	 !log cleanup stalled backups on gitlab1001, re-run full backup
[15:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:23] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:51:12] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:01:34] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:04:27] <_joe_>	 zotero shouldn't be paging
[16:15:58] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:18:06] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:18:48] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:20:24] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:22:35] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:37:09] <icinga-wm>	 PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
[16:37:39] <perryprog>	 that doesn't look good?
[16:39:05] <perryprog>	 _joe_ you around? ^
[16:43:53] <icinga-wm>	 RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
[16:54:55] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:55:38] <wikibugs>	 (03PS1) 10Majavah: P:openstack::designate: set base_url to use the https port [puppet] - 10https://gerrit.wikimedia.org/r/800948 (https://phabricator.wikimedia.org/T267194)
[16:55:40] <wikibugs>	 (03PS1) 10Majavah: P:openstack::glance: remove primary_image_store concept [puppet] - 10https://gerrit.wikimedia.org/r/800949
[16:55:42] <wikibugs>	 (03PS1) 10Majavah: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950
[16:55:44] <wikibugs>	 (03PS1) 10Majavah: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951
[16:55:46] <wikibugs>	 (03PS1) 10Majavah: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194)
[16:55:48] <wikibugs>	 (03PS1) 10Majavah: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194)
[16:55:50] <wikibugs>	 (03PS1) 10Majavah: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194)
[16:55:52] <wikibugs>	 (03PS1) 10Majavah: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194)
[17:14:34] <wikibugs>	 (03PS2) 10Majavah: P:openstack::glance: remove primary_image_store concept [puppet] - 10https://gerrit.wikimedia.org/r/800949
[17:14:37] <wikibugs>	 (03PS2) 10Majavah: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950
[17:14:39] <wikibugs>	 (03PS2) 10Majavah: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951
[17:14:41] <wikibugs>	 (03PS2) 10Majavah: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194)
[17:14:43] <wikibugs>	 (03PS2) 10Majavah: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194)
[17:14:45] <wikibugs>	 (03PS2) 10Majavah: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194)
[17:14:47] <wikibugs>	 (03PS2) 10Majavah: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194)
[17:17:17] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:17:39] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35612/console" [puppet] - 10https://gerrit.wikimedia.org/r/800951 (owner: 10Majavah)
[17:18:45] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:32:45] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:36:19] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:43:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:46:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28893 and previous config saved to /var/cache/conftool/dbconfig/20220529-184614-ladsgroup.json
[18:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:23] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[18:55:37] <icinga-wm>	 PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:59:39] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28894 and previous config saved to /var/cache/conftool/dbconfig/20220529-190119-ladsgroup.json
[19:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:01] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:16:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28895 and previous config saved to /var/cache/conftool/dbconfig/20220529-191625-ladsgroup.json
[19:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:51] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:21:53] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:31:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28896 and previous config saved to /var/cache/conftool/dbconfig/20220529-193130-ladsgroup.json
[19:31:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:31:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:36] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[19:31:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28897 and previous config saved to /var/cache/conftool/dbconfig/20220529-193138-ladsgroup.json
[19:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:27] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:14:13] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:23:03] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:40:21] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:15:29] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:22:19] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:25:31] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:54:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28898 and previous config saved to /var/cache/conftool/dbconfig/20220529-215425-ladsgroup.json
[21:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:33] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[21:57:11] <wikibugs>	 (03PS1) 10Ladsgroup: Stop trying to pass legacy page_restrictions to RestrictionStore [extensions/LiquidThreads] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800705 (https://phabricator.wikimedia.org/T309460)
[22:09:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28899 and previous config saved to /var/cache/conftool/dbconfig/20220529-220930-ladsgroup.json
[22:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:55] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:23:35] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:24:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28900 and previous config saved to /var/cache/conftool/dbconfig/20220529-222435-ladsgroup.json
[22:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:41] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:39:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28901 and previous config saved to /var/cache/conftool/dbconfig/20220529-223940-ladsgroup.json
[22:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:48] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[22:44:29] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:09:06] <Southparkfan>	 not sure which channel is appropriate for wikitech sysop requests, but could use some help there with that vandal
[23:27:09] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook