[00:02:29] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [00:18:01] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [00:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [00:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:52:29] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:56:55] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:07:51] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg partman test [puppet] - 10https://gerrit.wikimedia.org/r/802905 [01:08:17] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [01:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:07] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg partman test [puppet] - 10https://gerrit.wikimedia.org/r/802905 (owner: 10Andrew Bogott) [01:22:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [01:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [01:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:36] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [01:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298560)', diff saved to https://phabricator.wikimedia.org/P29399 and previous config saved to /var/cache/conftool/dbconfig/20220605-020015-ladsgroup.json [02:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:19] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:15:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29400 and previous config saved to /var/cache/conftool/dbconfig/20220605-021520-ladsgroup.json [02:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29401 and previous config saved to /var/cache/conftool/dbconfig/20220605-023025-ladsgroup.json [02:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:33] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:37:51] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 9 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:45:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298560)', diff saved to https://phabricator.wikimedia.org/P29402 and previous config saved to /var/cache/conftool/dbconfig/20220605-024530-ladsgroup.json [02:45:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [02:45:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [02:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:36] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298560)', diff saved to https://phabricator.wikimedia.org/P29403 and previous config saved to /var/cache/conftool/dbconfig/20220605-024538-ladsgroup.json [02:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [02:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [03:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [03:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [03:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [03:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:58] (03PS1) 10Andrew Bogott: Revert "hwraid-2dev.cfg partman test" [puppet] - 10https://gerrit.wikimedia.org/r/802907 [04:03:00] (03PS1) 10Andrew Bogott: Revert "hwraid-2dev.cfg: Try to get grub onto the boot partition" [puppet] - 10https://gerrit.wikimedia.org/r/802908 [04:05:15] (03CR) 10Andrew Bogott: [C: 03+2] Revert "hwraid-2dev.cfg: Try to get grub onto the boot partition" [puppet] - 10https://gerrit.wikimedia.org/r/802908 (owner: 10Andrew Bogott) [04:05:22] (03CR) 10Andrew Bogott: [C: 03+2] Revert "hwraid-2dev.cfg partman test" [puppet] - 10https://gerrit.wikimedia.org/r/802907 (owner: 10Andrew Bogott) [04:22:37] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:23:33] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:27:13] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 451 probes of 673 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:29:55] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 124 probes of 673 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:58:31] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:46:57] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.38 ms [05:47:51] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 238.98 ms [05:49:13] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 62 probes of 673 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:51:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 61 probes of 673 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:55:05] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220605T0700) [07:03:51] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 51.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:04:39] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:06:09] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 100.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:06:59] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 103.6 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:37:05] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:38:17] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:43] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:58] hey Amir1, you are wise in the ways of databases, T309939 has piqued my interest — any ideas? [10:04:59] T309939: Wikimedia\Rdbms\DBQueryError from line 1700 of Database.php: Error 1366: Incorrect string value - https://phabricator.wikimedia.org/T309939 [10:05:46] (tl;dr `Error 1366: Incorrect string value: '\xC5\x91'` - sounds like an utf8mb3/utf8mb4 issue to me, but I have next to no experience with this) [11:01:17] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:41:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298560)', diff saved to https://phabricator.wikimedia.org/P29404 and previous config saved to /var/cache/conftool/dbconfig/20220605-114139-ladsgroup.json [11:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:43] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [11:56:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29405 and previous config saved to /var/cache/conftool/dbconfig/20220605-115644-ladsgroup.json [11:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298560)', diff saved to https://phabricator.wikimedia.org/P29406 and previous config saved to /var/cache/conftool/dbconfig/20220605-120747-ladsgroup.json [12:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:52] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:11:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29407 and previous config saved to /var/cache/conftool/dbconfig/20220605-121149-ladsgroup.json [12:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29408 and previous config saved to /var/cache/conftool/dbconfig/20220605-122252-ladsgroup.json [12:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298560)', diff saved to https://phabricator.wikimedia.org/P29409 and previous config saved to /var/cache/conftool/dbconfig/20220605-122654-ladsgroup.json [12:26:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:26:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:59] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298560)', diff saved to https://phabricator.wikimedia.org/P29410 and previous config saved to /var/cache/conftool/dbconfig/20220605-122702-ladsgroup.json [12:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29411 and previous config saved to /var/cache/conftool/dbconfig/20220605-123757-ladsgroup.json [12:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:31] (03PS1) 10Stang: Revert "votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802833 [12:51:17] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 55536 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [12:53:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298560)', diff saved to https://phabricator.wikimedia.org/P29412 and previous config saved to /var/cache/conftool/dbconfig/20220605-125302-ladsgroup.json [12:53:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:53:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:08] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:27] TheresNoTime: hey, I'm off for the next three weeks :/ I suggest talking to Manuel about it tomorrow [13:44:00] (03PS1) 10Andrew Bogott: Openstack nova vendordata: increase metadata timeouts [puppet] - 10https://gerrit.wikimedia.org/r/802914 (https://phabricator.wikimedia.org/T309930) [13:48:07] (03CR) 10Andrew Bogott: [C: 03+2] Openstack nova vendordata: increase metadata timeouts [puppet] - 10https://gerrit.wikimedia.org/r/802914 (https://phabricator.wikimedia.org/T309930) (owner: 10Andrew Bogott) [14:35:47] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:06:01] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:35] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:18:11] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:36:59] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:33] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:47] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:48:23] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:48:51] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Xover) @BBlack The last status update on this bug was ~18 months a... [17:20:35] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:15:03] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:50:53] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:33] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:56] (03PS1) 10Mitar: Add page metadata to Wikibase JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) [20:17:25] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:24:15] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:25:25] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:35:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298560)', diff saved to https://phabricator.wikimedia.org/P29413 and previous config saved to /var/cache/conftool/dbconfig/20220605-213547-ladsgroup.json [21:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:52] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [21:50:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P29414 and previous config saved to /var/cache/conftool/dbconfig/20220605-215052-ladsgroup.json [21:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P29415 and previous config saved to /var/cache/conftool/dbconfig/20220605-220557-ladsgroup.json [22:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:19] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298560)', diff saved to https://phabricator.wikimedia.org/P29416 and previous config saved to /var/cache/conftool/dbconfig/20220605-222102-ladsgroup.json [22:21:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [22:21:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [22:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:08] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [22:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298560)', diff saved to https://phabricator.wikimedia.org/P29417 and previous config saved to /var/cache/conftool/dbconfig/20220605-222110-ladsgroup.json [22:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:55] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook