[00:11:36] <jinxer-wm>	 (Nonwrite HTTP requests with primary DB writes alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+writes+alert
[00:38:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941925
[00:38:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941925 (owner: 10TrainBranchBot)
[00:54:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/941925 (owner: 10TrainBranchBot)
[02:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:18] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:19:04] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:04] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:52] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:32:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:56:41] <wikibugs>	 10SRE, 10Research, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10fkaelin)
[03:58:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[05:39:44] <wikibugs>	 (03PS13) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035)
[05:43:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[06:02:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:04:17] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:07:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:09:17] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:17:21] <wikibugs>	 (03PS1) 10KartikMistry: WIP: cxserver: Remove Youdao MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/942748 (https://phabricator.wikimedia.org/T329137)
[08:28:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[09:59:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:59:36] <wikibugs>	 10SRE, 10Research, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done  You might need to add some exceptions to https://lists.wikimedia.org/postorius/lists/research-engineering-alerts.lists.wi...
[10:04:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:32:26] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:33:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:38:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[13:20:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:25:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:06:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:33] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:33] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:10:47] <icinga-wm>	 PROBLEM - Host db1130 #page is DOWN: PING CRITICAL - Packet loss = 100%
[16:11:45] <marostegui>	 can someone repool?
[16:11:48] <marostegui>	 depool
[16:11:56] * Emperor here 
[16:12:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (3) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:12:49] <Amir1>	 I'm way afk. Please depool.
[16:12:53] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1185 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:12:55] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1183 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:12:56] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1161 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:12:57] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1210 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:12:58] <icinga-wm>	 RECOVERY - Host db1130 #page is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[16:13:02] <marostegui>	 dbctl instance db 1130 depool 
[16:13:09] <Emperor>	 hold on a mo, on it
[16:13:10] <marostegui>	 from cumin 
[16:13:18] <Amir1>	 marostegui: isn't it master?
[16:13:23] <_joe_>	 it's a master
[16:13:24] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1200 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:13:27] <Amir1>	 It might be s5 master
[16:13:27] <marostegui>	 shit 
[16:13:28] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1213 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:13:30] <Amir1>	 Shir
[16:13:34] * Emperor holding off
[16:13:46] <_joe_>	 the server is up
[16:13:53] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on dbstore1003 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:13:53] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db2113 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:13:59] <Amir1>	 It takes me half an hour at least to get home
[16:14:00] <Emperor>	 orchestrator thinks db1130 is master
[16:14:02] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:07] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db1216 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:19] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db1145 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1130.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:19] <Amir1>	 _joe_: what's slave status?
[16:14:20] <_joe_>	 marostegui: can I just restart mysql?
[16:14:31] <_joe_>	 slave status afe all down
[16:14:34] <Emperor>	 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_master_(a.k.a._promoting_a_new_replica_to_master) says call a DBA if master is sad
[16:15:03] <_joe_>	 can I just restart mariadb?
[16:15:05] <marostegui>	 one sec 
[16:15:10] <_joe_>	 the server rebooted
[16:15:13] <Amir1>	 Restart mariad but don't forget to run start slave
[16:15:19] <claime>	 Here if needed
[16:15:23] <Emperor>	 I'll ack the p.age
[16:15:24] <_joe_>	 start slave on a master?
[16:15:47] <Amir1>	 Sorry stupid thing
[16:15:52] <Amir1>	 Set read only to off
[16:15:53] <_joe_>	 !log systemctl start mariadb.service on db1130
[16:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:06] <_joe_>	 if mysql starts that is
[16:16:10] <icinga-wm>	 PROBLEM - MariaDB read only s5 #page on db1130 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[16:16:16] <Amir1>	 Once restarted. It's read only by default 
[16:16:25] <marostegui>	 sorry I was driving 
[16:16:27] <marostegui>	 just stopped 
[16:16:32] <marostegui>	 can someone check the logs?
[16:16:33] <_joe_>	 ok, I don't even remember how to get into mysql
[16:16:36] <marostegui>	 please do a few selects 
[16:16:46] <Amir1>	 I'm also on phone without access anytime soon
[16:16:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on dbstore1003 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:16:53] <marostegui>	 I'm home in 10 minutes
[16:16:55] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db2113 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:02] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:09] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db1216 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:09] <marostegui>	 _joe_: can I call you?
[16:17:10] <_joe_>	 the server is up and running
[16:17:14] <_joe_>	 marostegui: sure
[16:17:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:17:19] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db1145 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:25] <Amir1>	 _joe_: in cumin: sudo db-mysql db1130
[16:17:30] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1185 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:32] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1183 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:32] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1161 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:17:36] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1210 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:18:02] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1200 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:18:06] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1213 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:18:26] <Emperor>	 nothing interesting in kernel log
[16:18:31] <Amir1>	 Do a couple of selects
[16:18:40] <claime>	 123 | Jul-29-2023 | 16:08:55 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error
[16:18:48] <claime>	 That would do it
[16:18:49] <icinga-wm>	 PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:05] <Amir1>	 Like "select max(rev_id) from revision" on dewiki
[16:19:21] <Amir1>	 Restart heartbeat please 
[16:19:24] <Amir1>	 ^
[16:19:28] <claime>	 on it
[16:19:46] <claime>	 done and started
[16:19:46] <_joe_>	 !log set read_only=0 on db1130
[16:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:59] <_joe_>	 claime: did you restart heartbeat
[16:20:03] <claime>	 _joe_: just did
[16:20:19] <icinga-wm>	 RECOVERY - Check systemd state on db1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:20:31] <claime>	 _joe_: shouldn't i have?
[16:20:37] <Amir1>	 This is to bring back functionality. Now we need to check for corruptions :/
[16:20:48] <icinga-wm>	 RECOVERY - MariaDB read only s5 #page on db1130 is OK: Version 10.4.26-MariaDB-log, Uptime 308s, read_only: False, event_scheduler: True, 1374.95 QPS, connection latency: 0.004782s, query latency: 0.000239s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[16:21:10] <claime>	 Amir1: corruption is possible, since the trigger for the reboot seems to be an ECC uncorrectable error
[16:21:11] <Emperor>	 Amir1: max(rev_id) on dewiki is 235930044 ; 
[16:21:31] <Emperor>	 (not sure what to do with that information, other than it tells us the database started up vaguely OK)
[16:21:33] <Amir1>	 We need to switchover ASAP
[16:22:06] <Emperor>	 Amir1: do you want us to attempt that before m.arostegui gets to a computer?
[16:22:10] <marostegui>	 yeah, I'm home in 15 minutes and I will do. switch 
[16:22:10] <_joe_>	 replication restarted fine
[16:22:12] <_joe_>	 and no
[16:22:16] <Amir1>	 I'll be home in half an hour. Maybe Manuel 
[16:22:23] <_joe_>	 marostegui said he'll take a further look when he's home
[16:22:27] <marostegui>	 don't worry I'll get to it
[16:22:29] <Amir1>	 Awesome 
[16:22:31] <_joe_>	 right now the wikis work, replication is working fine
[16:22:33] <marostegui>	 can someone create a quick task?
[16:22:43] <marostegui>	 to track the crash
[16:22:45] <Amir1>	 Thanks marostegui 💙💙💙
[16:22:57] <Emperor>	 I'll make a phab task to track the crash
[16:23:14] <claime>	 should we update wikimediastatus?
[16:23:21] <marostegui>	 thanks Emperor 
[16:23:46] <Amir1>	 claime: the impact is read only for dewiki and a bunch of wikis. Not too major tbh 
[16:24:10] <marostegui>	 I disagree a bit with "not too major" :)
[16:24:16] <claime>	 spike in errors
[16:24:22] <marostegui>	 it wasn't too many minutes that's for sure 
[16:24:28] <claime>	 I'd at least post that we had a db event and we're monitoring
[16:24:40] <Amir1>	 I'd like to have a clear tiers for incidents so we know it for sure but that's for later
[16:24:41] <_joe_>	 I can edit on dewiki just fine
[16:25:12] <Amir1>	 Yeah. It was a couple of minutes 
[16:25:14] <wikibugs>	 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10MatthewVernon)
[16:26:06] <wikibugs>	 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10Marostegui) a:03Marostegui The initial issue was triagged. I'll be home in 10 minutes and will replace the master
[16:26:07] <Emperor>	 I've not put many details on that, but I think it's enough to capture what needs looking at
[16:26:21] <claime>	 ok I'll defer to you guys, not updating status
[16:26:21] <wikibugs>	 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10Marostegui) p:05Triage→03High
[16:26:30] <marostegui>	 yeah that's good Emperor thank you!
[16:26:38] <Emperor>	 NP. You need anything else doing?
[16:26:40] <Amir1>	 Thanks
[16:26:56] <marostegui>	 nope, you can go back to the UK fantastic weather Emperor 
[16:27:16] * Emperor got rained on earlier :)
[16:27:30] <marostegui>	 xD
[16:27:39] <claime>	 Added the ipmi-sel with date and uptime for time correlation
[16:27:43] <wikibugs>	 10SRE, 10DBA: db1130 crash - https://phabricator.wikimedia.org/T343076 (10Clement_Goubert) ` cgoubert@db1130:~$ sudo ipmi-sel | grep Jul-29; date; uptime 123 | Jul-29-2023 | 16:08:55 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error Sat 29 Jul 2023 04:27:01 PM UTC 16:27:01 up 14 min...
[16:29:25] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942786 (https://phabricator.wikimedia.org/T343077)
[16:30:17] <wikibugs>	 10SRE, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui)
[16:30:59] <wikibugs>	 10SRE, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui)
[16:36:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T343077
[16:36:07] <stashbot>	 T343077: Switchover s5 master (db1130 -> db1183) - https://phabricator.wikimedia.org/T343077
[16:36:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T343077
[16:36:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1183 with weight 0 T343077', diff saved to https://phabricator.wikimedia.org/P49798 and previous config saved to /var/cache/conftool/dbconfig/20230729-163621-root.json
[16:42:24] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942787 (https://phabricator.wikimedia.org/T343078)
[16:46:40] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s5-master cname [dns] - 10https://gerrit.wikimedia.org/r/942771 (https://phabricator.wikimedia.org/T343077)
[16:47:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942786 (https://phabricator.wikimedia.org/T343077) (owner: 10Gerrit maintenance bot)
[16:50:41] <wikibugs>	 (03PS1) 10Marostegui: db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/942772 (https://phabricator.wikimedia.org/T343077)
[16:57:34] <marostegui>	 !log Starting emergency s5 eqiad failover from db1130 to db1183 - T343077 T343076
[16:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:40] <stashbot>	 T343077: Switchover s5 master (db1130 -> db1183) - https://phabricator.wikimedia.org/T343077
[16:57:41] <stashbot>	 T343076: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076
[16:57:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Emergency switchover T343077', diff saved to https://phabricator.wikimedia.org/P49799 and previous config saved to /var/cache/conftool/dbconfig/20230729-165748-root.json
[16:58:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1183 to s5 primary T343077', diff saved to https://phabricator.wikimedia.org/P49800 and previous config saved to /var/cache/conftool/dbconfig/20230729-165813-root.json
[16:59:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 T343077', diff saved to https://phabricator.wikimedia.org/P49801 and previous config saved to /var/cache/conftool/dbconfig/20230729-165954-root.json
[17:00:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update s5-master cname [dns] - 10https://gerrit.wikimedia.org/r/942771 (https://phabricator.wikimedia.org/T343077) (owner: 10Marostegui)
[17:00:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/942772 (https://phabricator.wikimedia.org/T343077) (owner: 10Marostegui)
[17:01:37] <wikibugs>	 10SRE, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui)
[17:02:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui)
[17:04:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) db1130 is scheduled for refresh with the HW that is arriving this quarter at T341269
[17:06:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) @Jclark-ctr any chances you've got an old DIMM somewhere to replace this one?  ` /admin1/system1/logs1/log1-> show record123   properties   CreationTimestamp = 20230707040843.000000-300   ElementN...
[17:07:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui)
[17:52:28] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:50] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:39:08] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:40:26] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:43:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:48:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:28:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[21:00:52] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:00:58] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:28] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:10:56] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:14:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:10] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:14] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:57:04] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:03:40] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:43:36] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)