[00:38:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954368 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954368 (owner: 10TrainBranchBot) [00:54:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954368 (owner: 10TrainBranchBot) [01:19:38] (03CR) 10Deni: [C: 03+1] "Approved on-wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [01:25:40] (03CR) 10Acamicamacaraca: "I can reschedule this for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949171 (https://phabricator.wikimedia.org/T344306) (owner: 10Acamicamacaraca) [02:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:59] PROBLEM - snapshot of s2 in eqiad on backupmon1001 is CRITICAL: snapshot for s2 at eqiad (db1139) taken more than 3 days ago: Most recent backup 2023-08-31 03:10:45 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:34:07] (03PS1) 10Terasail: Add 'confirmed' to Wikifunctions sysop add and remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954363 (https://phabricator.wikimedia.org/T344261) [04:32:11] (03CR) 10Terasail: "Add Jdforrester as reviewer (WF Staff and reviewer of other WF changes)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954363 (https://phabricator.wikimedia.org/T344261) (owner: 10Terasail) [05:10:35] PROBLEM - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: snapshot for x1 at eqiad (db1216) taken more than 3 days ago: Most recent backup 2023-08-31 05:05:47 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:05:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:06:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:33:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230903T0700) [09:26:19] (03PS1) 10Marostegui: db1128: Host crashed [puppet] - 10https://gerrit.wikimedia.org/r/954392 (https://phabricator.wikimedia.org/T345509) [09:28:30] (03CR) 10Marostegui: [C: 03+2] db1128: Host crashed [puppet] - 10https://gerrit.wikimedia.org/r/954392 (https://phabricator.wikimedia.org/T345509) (owner: 10Marostegui) [09:29:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:34:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:35:02] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:44:48] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:23:00] (ProbeDown) firing: Service etherpad1003:9001 has failed probes (http_etherpad_nodejs_ip6)- https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:9001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:00] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:59] (JobUnavailable) firing: (2) Reduced availability for job etherpad in ops@eqiad- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:38:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:43:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200- https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:56:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.521 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:42:07] (03PS3) 10Acamicamacaraca: Enable AbuseFilter blocks on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954240 (https://phabricator.wikimedia.org/T345513) [15:51:38] db1128 paged again but is already depooled, expired ack from yesterday [15:51:46] on phone but no action required anywya [15:53:15] I'm going to resolve it so it doesn't page again [15:53:25] (it was just expired page) [15:53:46] yes, ok, fair [15:56:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:01:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)- https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:13:05] PROBLEM - Host db1137 #page is DOWN: PING CRITICAL - Packet loss = 100% [18:14:29] today is not kind to us [18:14:32] let me check [18:15:38] host has broken memory per SEL [18:16:00] Corretable memory error rate exceeded for DIMM_B6 [18:16:05] Correctable memory error rate exceeded for DIMM_B6 [18:16:38] depooled [18:16:48] that's only one replica left for x1 [18:17:02] tomrrow-me problem [18:17:27] I'll open a DC ops task, the server is long OOW, but maybe we have a spare module [18:18:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:19:47] 10ops-eqiad, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10MoritzMuehlenhoff) [18:20:02] and acked in VO [18:43:19] RECOVERY - Host db1137 #page is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:45:31] PROBLEM - MariaDB Replica IO: x1 #page on db1137 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:46:15] PROBLEM - mysqld processes #page on db1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:46:33] PROBLEM - MariaDB Replica SQL: x1 #page on db1137 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:46:44] PROBLEM - MariaDB read only x1 on db1137 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [18:46:49] Sigh [18:46:57] * Emperor appears. [18:47:28] I'm catching up with scroll, but can these all just be ACKd given db1137 is depooled and the lack of x1 master is a "tomorrow" problem? [18:47:47] yeah, let's downtime it [18:47:51] I've just acked them in VO [18:48:01] (the three new ones, which are all related to the original one) [18:48:06] moritzm: let's resolve it, otherwise it pages in 24 hours [18:48:16] sure, can do [18:48:28] I try to downtime it [18:48:46] done [18:49:18] thanks [18:54:37] downtimed [18:54:59] cool, thanks, I shall go back to my Sunday evening :) [19:36:39] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence: Broken DIMM on db1137 - https://phabricator.wikimedia.org/T345514 (10Marostegui) [22:20:02] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable