[00:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106745 [00:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106745 (owner: 10TrainBranchBot) [00:56:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106745 (owner: 10TrainBranchBot) [01:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106746 [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106746 (owner: 10TrainBranchBot) [01:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423308 (10phaultfinder) [01:25:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106746 (owner: 10TrainBranchBot) [01:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:13:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:13:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:14:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:34:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423311 (10phaultfinder) [04:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241226T0800) [08:14:48] (03PS1) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) [08:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423343 (10phaultfinder) [08:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423344 (10phaultfinder) [09:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423346 (10phaultfinder) [09:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:00] PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table enwiki.pagelinks: Index for table pagelinks is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1196-bin.005085, end_log_pos 449719054 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:48] PROBLEM - MariaDB Replica Lag: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:20] PROBLEM - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 633.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:20] PROBLEM - MariaDB Replica Lag: s1 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 633.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:46] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:05:34] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:05:47] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:06:07] PROBLEM - Host mw1420 is DOWN: PING CRITICAL - Packet loss = 100% [10:06:09] PROBLEM - Host db1187 #page is DOWN: PING CRITICAL - Packet loss = 100% [10:06:15] RECOVERY - Host mw1420 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [10:06:20] PROBLEM - Host db1179 #page is DOWN: PING CRITICAL - Packet loss = 100% [10:06:39] PROBLEM - Host cloudsw1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:06:39] PROBLEM - Host cloudsw1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:06:39] PROBLEM - Host cloudsw1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:06:46] RECOVERY - Host db1179 #page is UP: PING WARNING - Packet loss = 50%, RTA = 0.35 ms [10:06:50] RECOVERY - Host db1187 #page is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:07:06] rack down? [10:07:17] RECOVERY - Host cloudsw1-e4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [10:07:17] RECOVERY - Host cloudsw1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.14 ms [10:07:17] RECOVERY - Host cloudsw1-f4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [10:07:24] seems to be network [10:07:29] Network flap? [10:07:35] seems to [10:07:37] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:07:39] the hosts are up [10:07:47] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:09:19] PROBLEM - Host dbprov1005 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:47] RECOVERY - Host dbprov1005 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [10:14:17] !incidents [10:14:17] 5574 (UNACKED) Host db1187 (paged) - PING - Packet loss = 100% [10:15:33] db1187 is in eqiad row A6 [10:15:47] !ack 5574 [10:15:48] 5574 (ACKED) Host db1187 (paged) - PING - Packet loss = 100% [10:16:40] FIRING: KubernetesRsyslogDown: rsyslog on mw1458:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1458 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:17:25] the other host which pag.ed db1179 is in b5 [10:17:57] jelto: yes, see security [10:21:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10423360 (10Marostegui) Let's get a replacement disk [10:22:25] !incidents [10:22:25] 5574 (ACKED) Host db1187 (paged) - PING - Packet loss = 100% [10:26:40] RESOLVED: KubernetesRsyslogDown: rsyslog on mw1458:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1458 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:27:52] Follow up task for the alert: https://phabricator.wikimedia.org/T382772 [10:39:41] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2947.34 seconds Marostegui https://phabricator.wikimedia.org/T382771 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:41] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2978.39 seconds Marostegui https://phabricator.wikimedia.org/T382771 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:41] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2945.26 seconds Marostegui https://phabricator.wikimedia.org/T382771 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:41] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2978.38 seconds Marostegui https://phabricator.wikimedia.org/T382771 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:41] ACKNOWLEDGEMENT - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table enwiki.pagelinks: Index for table pagelinks is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1196-bin.005085, end_log_pos 449719054 Marostegui https://phabricator.wikimedia.org/T382771 https://wikitech.wikime [10:39:42] wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:15:24] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [11:16:16] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Docker [11:44:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423380 (10phaultfinder) [12:14:16] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1106917 (owner: 10L10n-bot) [12:18:29] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1106920 (owner: 10L10n-bot) [12:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423400 (10phaultfinder) [13:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423413 (10phaultfinder) [13:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423438 (10phaultfinder) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:08] RECOVERY - MariaDB Replica SQL: s1 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:51:56] PROBLEM - MariaDB Replica SQL: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table enwiki.pagelinks: Index for table pagelinks is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1154-bin.003696, end_log_pos 306925282 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423475 (10phaultfinder) [15:42:54] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:43:26] RECOVERY - MariaDB Replica Lag: s1 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:43:27] RECOVERY - MariaDB Replica Lag: s1 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423507 (10phaultfinder) [16:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423514 (10phaultfinder) [16:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423518 (10phaultfinder) [17:18:56] RECOVERY - MariaDB Replica SQL: s1 on an-redacteddb1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423523 (10phaultfinder) [17:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423527 (10phaultfinder) [18:02:11] Use BIDEN46 to receive a free Presidential Pardon with your first purchase. Promo valid until January 20th. https://BidenCash.bid - Best CC from White House (for now!) [18:09:56] RECOVERY - MariaDB Replica Lag: s1 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:20:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423544 (10phaultfinder) [18:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423667 (10phaultfinder) [19:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423760 (10phaultfinder) [19:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423863 (10phaultfinder) [20:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423868 (10phaultfinder) [20:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423887 (10phaultfinder) [21:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423894 (10phaultfinder) [21:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423897 (10phaultfinder) [23:14:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10423922 (10phaultfinder) [23:42:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:42:24] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:43:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:43:14] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring