[00:11:07] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:26:24] <wikibugs>	 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th): Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10473310 (10Ahoelzl) 05Open→03Resolved
[00:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112307
[00:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112307 (owner: 10TrainBranchBot)
[00:44:31] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:56:50] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112307 (owner: 10TrainBranchBot)
[01:01:29] <logmsgbot>	 !log mstyles@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[01:02:21] <logmsgbot>	 !log mstyles@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[01:04:31] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[01:04:31] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[01:05:08] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[01:06:14] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[01:06:55] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[01:07:04] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[01:07:06] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[01:07:16] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[01:07:19] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[01:08:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112308
[01:08:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112308 (owner: 10TrainBranchBot)
[01:09:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473484 (10phaultfinder)
[01:29:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112308 (owner: 10TrainBranchBot)
[02:12:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:18:23] <icinga-wm>	 PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473519 (10phaultfinder)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[03:04:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473539 (10phaultfinder)
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:05:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473579 (10phaultfinder)
[04:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:28:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:30:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:32:35] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:34:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473596 (10phaultfinder)
[04:35:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:40:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:44:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:47:53] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:49:31] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[04:49:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:54:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:07:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:09:31] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:42:06] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384110 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[05:42:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384110 (10ops-monitoring-bot) 03NEW
[06:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473637 (10phaultfinder)
[06:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[07:40:13] <Ammar>	 !log T384109 Ran mwscript-k8s --comment="T384109" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki 'Ruhaina Sheikh' 'Lee Ailiseu'
[07:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:17] <stashbot>	 T384109: Unblock stuck global renames - https://phabricator.wikimedia.org/T384109
[07:40:32] <Ammar>	 !log T384109 Ran mwscript-k8s --comment="T384109" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=hywiktionary --logwiki=metawiki 'Մարիամ11' 'GrigorYAN'
[07:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:37:34] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:38:26] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.684 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:49:00] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:50:54] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29350 bytes in 1.644 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:51:59] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384111 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[08:52:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384111 (10ops-monitoring-bot) 03NEW
[08:54:31] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[08:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:58:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473672 (10phaultfinder)
[09:06:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384111#10473675 (10Peachey88) →14Duplicate dup:03T382984
[09:06:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10473677 (10Peachey88)
[09:06:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384110#10473679 (10Peachey88) →14Duplicate dup:03T382984
[09:06:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10473681 (10Peachey88)
[09:14:31] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[10:39:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[10:44:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[10:44:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473734 (10phaultfinder)
[10:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[12:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:27:06] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update Striker to 2025-01-18-122323-production [puppet] - 10https://gerrit.wikimedia.org/r/1112327
[12:48:41] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update Striker to 2025-01-18-122323-production [puppet] - 10https://gerrit.wikimedia.org/r/1112327 (owner: 10Majavah)
[12:50:24] <wikibugs>	 (03PS1) 10Majavah: Revert "prometheus: scrape otelcol metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1112328
[12:52:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Revert "prometheus: scrape otelcol metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1112328 (owner: 10Majavah)
[12:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:59:31] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[12:59:44] <jinxer-wm>	 FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:19:31] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[13:22:42] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:26:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473805 (10phaultfinder)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1170 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:48:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[14:52:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:54:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:12:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1170 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473827 (10phaultfinder)
[15:44:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473844 (10phaultfinder)
[16:11:02] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[16:11:54] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29350 bytes in 1.804 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[16:14:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10473845 (10phaultfinder)
[16:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:30:14] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:56] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:58:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1144 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:59:44] <jinxer-wm>	 FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:01:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:03:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1144 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:03:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:04:31] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:11:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:18:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubestage2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:19:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:24:31] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:28:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:04:55] <wikibugs>	 (03PS1) 10Majavah: wikitech: Drop obsolete oauthadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112333 (https://phabricator.wikimedia.org/T384122)
[18:08:56] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 60%, RTA = 0.39 ms
[18:15:20] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[18:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[19:23:19] <wikibugs>	 (03PS1) 10Majavah: wikitech: Drop oathauth group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112335 (https://phabricator.wikimedia.org/T384123)
[20:25:30] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:30:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128 (10Dylsss) 03NEW
[20:40:15] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10474026 (10Dylsss)
[20:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:59:44] <jinxer-wm>	 FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:09:31] <jinxer-wm>	 FIRING: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:14:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10474044 (10phaultfinder)
[21:17:48] <wikibugs>	 06SRE, 10SRE-swift-storage: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10474045 (10A_smart_kitten)
[21:29:31] <jinxer-wm>	 RESOLVED: Wikidata Reliability Metrics - Median loading time alert: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[22:51:31] <jinxer-wm>	 FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[22:53:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:58:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:17:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2034-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:21:06] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:22:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2034-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:24:22] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service restbase2029-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:27:18] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service restbase2029-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:42:18] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2029-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:44:22] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2029-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown