[00:05:28] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109796 [00:38:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109796 (owner: 10TrainBranchBot) [00:39:49] (03PS2) 10NMW03: Add azwiki to mobile-anon-talk dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) [00:41:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10450419 (10VRiley-WMF) a:03VRiley-WMF [00:59:58] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109796 (owner: 10TrainBranchBot) [01:08:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109804 [01:08:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109804 (owner: 10TrainBranchBot) [01:12:24] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:12:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:13:19] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:29:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109804 (owner: 10TrainBranchBot) [01:53:37] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T383076#10450458 (10phaultfinder) [01:54:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:55:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:20:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:29:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:40:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:42:37] (03PS1) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableEventWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109832 (https://phabricator.wikimedia.org/T380078) [02:43:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109832 (https://phabricator.wikimedia.org/T380078) (owner: 10Daimona Eaytoy) [03:07:09] (03PS1) 10Daimona Eaytoy: Enable CampaignEvents extension on idwiki, itwiki, mswiki, and plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109842 (https://phabricator.wikimedia.org/T383154) [03:08:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109842 (https://phabricator.wikimedia.org/T383154) (owner: 10Daimona Eaytoy) [03:09:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10450562 (10phaultfinder) [03:20:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:52:04] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383474 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [03:52:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383474 (10ops-monitoring-bot) 03NEW [04:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:51:54] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383475 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [04:52:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383475 (10ops-monitoring-bot) 03NEW [05:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:28] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:41:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:51:28] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:21:54] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383476 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [06:22:00] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383476 (10ops-monitoring-bot) 03NEW [08:00:46] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383474#10450652 (10Peachey88) →14Duplicate dup:03T383475 [08:00:48] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383475#10450654 (10Peachey88) [08:01:25] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383476#10450656 (10Peachey88) →14Duplicate dup:03T383475 [08:01:28] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383475#10450658 (10Peachey88) [08:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:43] (03PS1) 10Brouberol: airflow-wmde: remove extra network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109926 (https://phabricator.wikimedia.org/T380613) [09:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:13] 10SRE-swift-storage, 10UploadWizard, 07Unstewarded-production-error, 07Wikimedia-production-error: "Could not store upload in the stash (UploadStashFileException)" for 2.4 GiB TIF file - https://phabricator.wikimedia.org/T285341#10450724 (10Yann) Same error message for a 2.42 GB video: `03452: finalize... [10:58:39] (03PS1) 10Majavah: hieradata: Add cloud-private v6 supernets [puppet] - 10https://gerrit.wikimedia.org/r/1109983 (https://phabricator.wikimedia.org/T379283) [10:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10450734 (10phaultfinder) [11:00:54] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4781/console" [puppet] - 10https://gerrit.wikimedia.org/r/1109983 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [11:13:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10450735 (10phaultfinder) [11:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10450738 (10phaultfinder) [12:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:28] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:52:28] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:16] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:21] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:28] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:28] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:23] (03PS1) 10Ladsgroup: Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) [19:11:04] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.252 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:16:02] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:19:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.295 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:20:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 3.087 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:23:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.571 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451052 (10phaultfinder) [19:27:20] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:29:02] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:32:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.610 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:34:06] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 3.087 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:37:20] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:38:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.729 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:40:10] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 7.167 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:44:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.351 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:45:02] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:48:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.689 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:50:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:57:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.465 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:59:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.651 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:09:06] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:10:06] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 3.075 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:10:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:19:16] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:23:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.750 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:26:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:28:20] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:29:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.032 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:31:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:34:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.585 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:38:06] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.281 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [20:38:20] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:53:42] FIRING: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:58:57] RESOLVED: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451099 (10phaultfinder) [21:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451115 (10phaultfinder) [21:38:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.174 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [21:43:06] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:08:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.671 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:12:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:16:40] (03PS1) 10Reedy: CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) [22:23:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.682 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:23:57] PROBLEM - MariaDB Replica SQL: s1 #page on db2212 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: enwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:27:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:27:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:31:57] PROBLEM - MariaDB Replica Lag: s1 #page on db2212 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 652.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:33:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.649 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:34:07] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:43:28] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:52:29] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:53:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.103 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:54:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:02:13] !log fabfur@cumin1002 dbctl commit (dc=all): 'Depool db2212', diff saved to https://phabricator.wikimedia.org/P71985 and previous config saved to /var/cache/conftool/dbconfig/20250111-230213-fabfur.json [23:02:39] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2212.codfw.wmnet with reason: Replication lag [23:02:53] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2212.codfw.wmnet with reason: Replication lag [23:57:44] FIRING: Wikidata Reliability Metrics - Median Payload alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert