[00:11:36] (Nonwrite HTTP requests with primary DB writes alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941925 [00:38:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941925 (owner: 10TrainBranchBot) [00:54:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941925 (owner: 10TrainBranchBot) [02:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:18] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:04] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:04] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:52] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:41] 10SRE, 10Research, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10fkaelin) [03:58:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [05:39:44] (03PS13) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [05:43:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [06:02:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:17:21] (03PS1) 10KartikMistry: WIP: cxserver: Remove Youdao MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/942748 (https://phabricator.wikimedia.org/T329137) [08:28:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [09:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:59:36] 10SRE, 10Research, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done You might need to add some exceptions to https://lists.wikimedia.org/postorius/lists/research-engineering-alerts.lists.wi... [10:04:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:32:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:38:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [13:20:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:47] PROBLEM - Host db1130 #page is DOWN: PING CRITICAL - Packet loss = 100% [16:11:45] can someone repool? [16:11:48] depool [16:11:56] * Emperor here [16:12:16] (MediaWikiHighErrorRate) firing: (3) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:12:49] I'm way afk. Please depool. [16:12:53] PROBLEM - MariaDB Replica IO: s5 #page on db1185 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:55] PROBLEM - MariaDB Replica IO: s5 #page on db1183 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:56] PROBLEM - MariaDB Replica IO: s5 #page on db1161 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:57] PROBLEM - MariaDB Replica IO: s5 #page on db1210 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:58] RECOVERY - Host db1130 #page is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:13:02] dbctl instance db 1130 depool [16:13:09] hold on a mo, on it [16:13:10] from cumin [16:13:18] marostegui: isn't it master? [16:13:23] <_joe_> it's a master [16:13:24] PROBLEM - MariaDB Replica IO: s5 #page on db1200 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:27] It might be s5 master [16:13:27] shit [16:13:28] PROBLEM - MariaDB Replica IO: s5 #page on db1213 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:30] Shir [16:13:34] * Emperor holding off [16:13:46] <_joe_> the server is up [16:13:53] PROBLEM - MariaDB Replica IO: s5 on dbstore1003 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:53] PROBLEM - MariaDB Replica IO: s5 on db2113 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:59] It takes me half an hour at least to get home [16:14:00] orchestrator thinks db1130 is master [16:14:02] PROBLEM - MariaDB Replica IO: s5 #page on db1144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:07] PROBLEM - MariaDB Replica IO: s5 on db1216 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:19] PROBLEM - MariaDB Replica IO: s5 on db1145 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:19] _joe_: what's slave status? [16:14:20] <_joe_> marostegui: can I just restart mysql? [16:14:31] <_joe_> slave status afe all down [16:14:34] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_master_(a.k.a._promoting_a_new_replica_to_master) says call a DBA if master is sad [16:15:03] <_joe_> can I just restart mariadb? [16:15:05] one sec [16:15:10] <_joe_> the server rebooted [16:15:13] Restart mariad but don't forget to run start slave [16:15:19] Here if needed [16:15:23] I'll ack the p.age [16:15:24] <_joe_> start slave on a master? [16:15:47] Sorry stupid thing [16:15:52] Set read only to off [16:15:53] <_joe_> !log systemctl start mariadb.service on db1130 [16:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:06] <_joe_> if mysql starts that is [16:16:10] PROBLEM - MariaDB read only s5 #page on db1130 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:16:16] Once restarted. It's read only by default [16:16:25] sorry I was driving [16:16:27] just stopped [16:16:32] can someone check the logs? [16:16:33] <_joe_> ok, I don't even remember how to get into mysql [16:16:36] please do a few selects [16:16:46] I'm also on phone without access anytime soon [16:16:53] RECOVERY - MariaDB Replica IO: s5 on dbstore1003 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:53] I'm home in 10 minutes [16:16:55] RECOVERY - MariaDB Replica IO: s5 on db2113 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:02] RECOVERY - MariaDB Replica IO: s5 #page on db1144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:09] RECOVERY - MariaDB Replica IO: s5 on db1216 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:09] _joe_: can I call you? [16:17:10] <_joe_> the server is up and running [16:17:14] <_joe_> marostegui: sure [16:17:16] (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:17:19] RECOVERY - MariaDB Replica IO: s5 on db1145 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:25] _joe_: in cumin: sudo db-mysql db1130 [16:17:30] RECOVERY - MariaDB Replica IO: s5 #page on db1185 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:32] RECOVERY - MariaDB Replica IO: s5 #page on db1183 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:32] RECOVERY - MariaDB Replica IO: s5 #page on db1161 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:36] RECOVERY - MariaDB Replica IO: s5 #page on db1210 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:18:02] RECOVERY - MariaDB Replica IO: s5 #page on db1200 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:18:06] RECOVERY - MariaDB Replica IO: s5 #page on db1213 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:18:26] nothing interesting in kernel log [16:18:31] Do a couple of selects [16:18:40] 123 | Jul-29-2023 | 16:08:55 | ECC Uncorr Err | Memory | Uncorrectable memory error [16:18:48] That would do it [16:18:49] PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:05] Like "select max(rev_id) from revision" on dewiki [16:19:21] Restart heartbeat please [16:19:24] ^ [16:19:28] on it [16:19:46] done and started [16:19:46] <_joe_> !log set read_only=0 on db1130 [16:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:59] <_joe_> claime: did you restart heartbeat [16:20:03] _joe_: just did [16:20:19] RECOVERY - Check systemd state on db1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:31] _joe_: shouldn't i have? [16:20:37] This is to bring back functionality. Now we need to check for corruptions :/ [16:20:48] RECOVERY - MariaDB read only s5 #page on db1130 is OK: Version 10.4.26-MariaDB-log, Uptime 308s, read_only: False, event_scheduler: True, 1374.95 QPS, connection latency: 0.004782s, query latency: 0.000239s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:21:10] Amir1: corruption is possible, since the trigger for the reboot seems to be an ECC uncorrectable error [16:21:11] Amir1: max(rev_id) on dewiki is 235930044 ; [16:21:31] (not sure what to do with that information, other than it tells us the database started up vaguely OK) [16:21:33] We need to switchover ASAP [16:22:06] Amir1: do you want us to attempt that before m.arostegui gets to a computer? [16:22:10] yeah, I'm home in 15 minutes and I will do. switch [16:22:10] <_joe_> replication restarted fine [16:22:12] <_joe_> and no [16:22:16] I'll be home in half an hour. Maybe Manuel [16:22:23] <_joe_> marostegui said he'll take a further look when he's home [16:22:27] don't worry I'll get to it [16:22:29] Awesome [16:22:31] <_joe_> right now the wikis work, replication is working fine [16:22:33] can someone create a quick task? [16:22:43] to track the crash [16:22:45] Thanks marostegui 💙💙💙 [16:22:57] I'll make a phab task to track the crash [16:23:14] should we update wikimediastatus? [16:23:21] thanks Emperor [16:23:46] claime: the impact is read only for dewiki and a bunch of wikis. Not too major tbh [16:24:10] I disagree a bit with "not too major" :) [16:24:16] spike in errors [16:24:22] it wasn't too many minutes that's for sure [16:24:28] I'd at least post that we had a db event and we're monitoring [16:24:40] I'd like to have a clear tiers for incidents so we know it for sure but that's for later [16:24:41] <_joe_> I can edit on dewiki just fine [16:25:12] Yeah. It was a couple of minutes [16:25:14] 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10MatthewVernon) [16:26:06] 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10Marostegui) a:03Marostegui The initial issue was triagged. I'll be home in 10 minutes and will replace the master [16:26:07] I've not put many details on that, but I think it's enough to capture what needs looking at [16:26:21] ok I'll defer to you guys, not updating status [16:26:21] 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10Marostegui) p:05Triage→03High [16:26:30] yeah that's good Emperor thank you! [16:26:38] NP. You need anything else doing? [16:26:40] Thanks [16:26:56] nope, you can go back to the UK fantastic weather Emperor [16:27:16] * Emperor got rained on earlier :) [16:27:30] xD [16:27:39] Added the ipmi-sel with date and uptime for time correlation [16:27:43] 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10Clement_Goubert) ` cgoubert@db1130:~$ sudo ipmi-sel | grep Jul-29; date; uptime 123 | Jul-29-2023 | 16:08:55 | ECC Uncorr Err | Memory | Uncorrectable memory error Sat 29 Jul 2023 04:27:01 PM UTC 16:27:01 up 14 min... [16:29:25] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942786 (https://phabricator.wikimedia.org/T343077) [16:30:17] 10SRE, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) [16:30:59] 10SRE, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) [16:36:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T343077 [16:36:07] T343077: Switchover s5 master (db1130 -> db1183) - https://phabricator.wikimedia.org/T343077 [16:36:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T343077 [16:36:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1183 with weight 0 T343077', diff saved to https://phabricator.wikimedia.org/P49798 and previous config saved to /var/cache/conftool/dbconfig/20230729-163621-root.json [16:42:24] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942787 (https://phabricator.wikimedia.org/T343078) [16:46:40] (03PS1) 10Marostegui: wmnet: Update s5-master cname [dns] - 10https://gerrit.wikimedia.org/r/942771 (https://phabricator.wikimedia.org/T343077) [16:47:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942786 (https://phabricator.wikimedia.org/T343077) (owner: 10Gerrit maintenance bot) [16:50:41] (03PS1) 10Marostegui: db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/942772 (https://phabricator.wikimedia.org/T343077) [16:57:34] !log Starting emergency s5 eqiad failover from db1130 to db1183 - T343077 T343076 [16:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:40] T343077: Switchover s5 master (db1130 -> db1183) - https://phabricator.wikimedia.org/T343077 [16:57:41] T343076: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 [16:57:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Emergency switchover T343077', diff saved to https://phabricator.wikimedia.org/P49799 and previous config saved to /var/cache/conftool/dbconfig/20230729-165748-root.json [16:58:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1183 to s5 primary T343077', diff saved to https://phabricator.wikimedia.org/P49800 and previous config saved to /var/cache/conftool/dbconfig/20230729-165813-root.json [16:59:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 T343077', diff saved to https://phabricator.wikimedia.org/P49801 and previous config saved to /var/cache/conftool/dbconfig/20230729-165954-root.json [17:00:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s5-master cname [dns] - 10https://gerrit.wikimedia.org/r/942771 (https://phabricator.wikimedia.org/T343077) (owner: 10Marostegui) [17:00:23] (03CR) 10Marostegui: [C: 03+2] db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/942772 (https://phabricator.wikimedia.org/T343077) (owner: 10Marostegui) [17:01:37] 10SRE, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) [17:02:10] 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) [17:04:45] 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) db1130 is scheduled for refresh with the HW that is arriving this quarter at T341269 [17:06:43] 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) @Jclark-ctr any chances you've got an old DIMM somewhere to replace this one? ` /admin1/system1/logs1/log1-> show record123 properties CreationTimestamp = 20230707040843.000000-300 ElementN... [17:07:24] 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) [17:52:28] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:50] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:08] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:40:26] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [21:00:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:58] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:04] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:43:36] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)