[00:02:31] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:30] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: improve incremental backup logic [puppet] - 10https://gerrit.wikimedia.org/r/829250 [01:02:32] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: format with black [puppet] - 10https://gerrit.wikimedia.org/r/829251 [01:05:59] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:14] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: improve incremental backup logic [puppet] - 10https://gerrit.wikimedia.org/r/829250 (owner: 10Andrew Bogott) [01:08:25] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: format with black [puppet] - 10https://gerrit.wikimedia.org/r/829251 (owner: 10Andrew Bogott) [01:30:15] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:05] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T314041)', diff saved to https://phabricator.wikimedia.org/P33746 and previous config saved to /var/cache/conftool/dbconfig/20220903-015502-ladsgroup.json [01:55:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [01:55:08] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:55:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [01:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T314041)', diff saved to https://phabricator.wikimedia.org/P33747 and previous config saved to /var/cache/conftool/dbconfig/20220903-015524-ladsgroup.json [02:00:47] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:01] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:53] (03PS1) 10Andrew Bogott: cinder: set backup_use_same_host=True [puppet] - 10https://gerrit.wikimedia.org/r/829253 (https://phabricator.wikimedia.org/T294429) [02:42:01] (03PS1) 10Andrew Bogott: Switch on a second cinder backup host for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/829256 (https://phabricator.wikimedia.org/T294429) [02:44:36] (03CR) 10Andrew Bogott: [C: 03+2] cinder: set backup_use_same_host=True [puppet] - 10https://gerrit.wikimedia.org/r/829253 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [02:44:59] (03CR) 10Andrew Bogott: [C: 03+2] Switch on a second cinder backup host for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/829256 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [03:23:20] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:01:55] (03PS1) 10Andrew Bogott: Revert "cinder: set backup_use_same_host=True" [puppet] - 10https://gerrit.wikimedia.org/r/829145 [04:03:14] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cinder: set backup_use_same_host=True" [puppet] - 10https://gerrit.wikimedia.org/r/829145 (owner: 10Andrew Bogott) [04:39:30] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [04:40:10] PROBLEM - Host mw2335.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:40:10] PROBLEM - Host mw2336.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:40:10] PROBLEM - Host mw2337.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:40:10] PROBLEM - Host mw2338.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:40:10] PROBLEM - Host mw2339.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:40:42] PROBLEM - Host sessionstore2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:41:00] PROBLEM - Host mw2416.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:41:10] PROBLEM - Host mw2413.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:41:10] PROBLEM - Host mw2412.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:41:26] PROBLEM - Host db2113.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:41:28] PROBLEM - Host db2141.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:41:28] PROBLEM - Host db2144.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:42:02] PROBLEM - Host phab2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:42:08] PROBLEM - Host prometheus2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:42:10] PROBLEM - Host restbase-dev2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:42:22] PROBLEM - Host rdb2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:06] PROBLEM - Host conf2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:30] PROBLEM - Host db2150.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:34] PROBLEM - Host ml-serve2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:38] PROBLEM - Host mw2414.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:40] PROBLEM - Host mw2415.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:40] PROBLEM - Host mw2417.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:40] PROBLEM - Host mw2418.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:43:56] PROBLEM - Host db2169.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:44:06] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.89 ms [04:45:40] RECOVERY - Host db2169.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [04:46:40] RECOVERY - Host mw2335.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms [04:46:40] RECOVERY - Host mw2336.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [04:46:40] RECOVERY - Host mw2337.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [04:46:40] RECOVERY - Host mw2338.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [04:46:40] RECOVERY - Host mw2339.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [04:47:12] RECOVERY - Host sessionstore2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [04:47:24] RECOVERY - Host mw2418.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [04:47:30] RECOVERY - Host mw2416.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.73 ms [04:47:40] RECOVERY - Host mw2413.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [04:47:40] RECOVERY - Host mw2412.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [04:47:56] RECOVERY - Host db2113.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [04:47:58] RECOVERY - Host db2144.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.87 ms [04:47:58] RECOVERY - Host db2141.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.94 ms [04:48:34] RECOVERY - Host phab2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [04:48:40] RECOVERY - Host prometheus2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [04:48:42] RECOVERY - Host restbase-dev2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [04:48:54] RECOVERY - Host rdb2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.85 ms [04:49:38] RECOVERY - Host conf2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [04:50:02] RECOVERY - Host db2150.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [04:50:06] RECOVERY - Host ml-serve2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [04:50:10] RECOVERY - Host mw2414.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [04:50:12] RECOVERY - Host mw2415.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [04:50:12] RECOVERY - Host mw2417.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [05:01:20] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:31:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220903T0700) [07:07:52] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:40:36] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:23:23] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 64.04 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [08:47:56] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 2.016 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [08:51:14] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:52:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:00:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:00:52] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:23:38] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:29:38] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:24] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [09:45:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [09:48:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:59:56] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:01:40] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:54] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:55:13] (03PS1) 10Legoktm: Update legoktm's root SSH key [labs/private] - 10https://gerrit.wikimedia.org/r/829262 [11:05:44] PROBLEM - Query Service HTTP Port on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:08:04] RECOVERY - Query Service HTTP Port on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:03:46] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:03:12] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:32] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:58:24] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:03:28] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Aklapper) @Dzahn: Not in the special case of `ldap/wmf` though, [per SRE instructions](https://wikitech.wikimedia.org/wik... [14:07:36] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:22] PROBLEM - SSH on db1098.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:27:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:38:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:39:15] (03PS1) 10Majavah: wikitech: drop webserver_hostname_aliases [puppet] - 10https://gerrit.wikimedia.org/r/829287 [14:40:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37106/console" [puppet] - 10https://gerrit.wikimedia.org/r/829287 (owner: 10Majavah) [15:03:22] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:02] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:12:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [15:12:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [15:12:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T312863)', diff saved to https://phabricator.wikimedia.org/P33748 and previous config saved to /var/cache/conftool/dbconfig/20220903-151224-ladsgroup.json [15:12:31] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:18:09] (03PS1) 10Majavah: keepalived: do not hardcode default interface name [puppet] - 10https://gerrit.wikimedia.org/r/829288 [15:18:11] (03PS1) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [15:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:52:56] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:27:24] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:08:02] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:42] RECOVERY - SSH on db1098.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T314041)', diff saved to https://phabricator.wikimedia.org/P33749 and previous config saved to /var/cache/conftool/dbconfig/20220903-180042-ladsgroup.json [18:00:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [18:00:51] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:00:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [18:01:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T314041)', diff saved to https://phabricator.wikimedia.org/P33750 and previous config saved to /var/cache/conftool/dbconfig/20220903-180104-ladsgroup.json [18:09:22] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:13:08] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:19:46] (03CR) 10Majavah: P:toolforge: remove linux kernel pinnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [19:19:16] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:19:22] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:21:46] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:21:52] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:31:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:31:48] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:10] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:16] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:44:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:44:12] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:46:38] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:46:44] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:48:12] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:50:42] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:55:56] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:22:12] PROBLEM - SSH on db1098.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:02:24] (03CR) 10Krinkle: [C: 03+1] webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff) [21:02:30] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T312863)', diff saved to https://phabricator.wikimedia.org/P33751 and previous config saved to /var/cache/conftool/dbconfig/20220903-211808-ladsgroup.json [21:18:14] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:23:34] RECOVERY - SSH on db1098.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P33752 and previous config saved to /var/cache/conftool/dbconfig/20220903-213314-ladsgroup.json [21:36:58] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:48:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P33753 and previous config saved to /var/cache/conftool/dbconfig/20220903-214820-ladsgroup.json [22:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T312863)', diff saved to https://phabricator.wikimedia.org/P33754 and previous config saved to /var/cache/conftool/dbconfig/20220903-220326-ladsgroup.json [22:03:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [22:03:32] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:03:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [22:03:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:04:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33755 and previous config saved to /var/cache/conftool/dbconfig/20220903-220427-ladsgroup.json [22:15:16] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:39:32] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:41:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 16 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:01:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33756 and previous config saved to /var/cache/conftool/dbconfig/20220903-230443-ladsgroup.json [23:04:48] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [23:19:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33757 and previous config saved to /var/cache/conftool/dbconfig/20220903-231949-ladsgroup.json [23:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33758 and previous config saved to /var/cache/conftool/dbconfig/20220903-233455-ladsgroup.json [23:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33759 and previous config saved to /var/cache/conftool/dbconfig/20220903-235001-ladsgroup.json [23:50:07] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863