[00:01:23] <icinga-wm>	 RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10Tgr)
[00:03:40] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic-Icebox, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Tgr)
[00:05:17] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:59:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:08:19] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 462.56 ms
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:55] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:52] <wikibugs>	 (03CR) 10RLazarus: slo_dashboards: move to one SLO/SLI per dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[01:48:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:01:21] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 38 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:19:13] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 17 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[03:32:38] <wikibugs>	 (03PS1) 10Varnent: Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547
[03:33:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (owner: 10Varnent)
[03:33:23] <wikibugs>	 (03PS2) 10Varnent: Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967)
[03:34:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent)
[03:47:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:59:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:05:47] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:06:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221029T0700)
[07:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[07:36:29] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:39:07] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:56:59] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:57:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:05:39] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:28:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:33:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:37:21] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:59:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:36:16] <wikibugs>	 (03PS1) 10Majavah: thanos::rule: allow using without object store [puppet] - 10https://gerrit.wikimedia.org/r/850628
[09:36:18] <wikibugs>	 (03PS1) 10Majavah: P:metricsinfra: add thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301)
[09:42:54] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[09:44:14] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[10:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:53:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:53:38] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:57:20] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:57:22] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:19:10] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:21:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:21:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:21:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[11:22:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:25:33] <taavi>	 !log deploy patch for T321971
[11:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:50] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms
[11:58:28] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:58:50] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:12:38] <icinga-wm>	 PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:28:10] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:40:28] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:46:28] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms
[12:59:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:59:18] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:01:51] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @Dzahn Great, thank you very much!
[13:02:14] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) 05In progress→03Resolved
[13:02:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Kelson)
[13:06:52] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[13:10:36] <icinga-wm>	 PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:13:30] <icinga-wm>	 RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:14:46] <icinga-wm>	 PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:03:22] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::instance: adjust syslog handling [puppet] - 10https://gerrit.wikimedia.org/r/850633
[14:08:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:24] <icinga-wm>	 RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:15:38] <icinga-wm>	 RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:24:50] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:33:24] <icinga-wm>	 PROBLEM - Host lvs1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:01:40] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms
[16:33:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 187 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:35:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:44:10] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:57:21] <wikibugs>	 (03PS1) 10Ahmon Dancy: gitlab_runner.yaml: Allow rsync to doc.discovery.wmnet from trusted runner containers [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629)
[16:57:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab_runner.yaml: Allow rsync to doc.discovery.wmnet from trusted runner containers [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy)
[16:58:45] <wikibugs>	 (03PS2) 10Ahmon Dancy: Allow rsync to doc.discovery.wmnet from trusted runner containers [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629)
[17:50:58] <icinga-wm>	 PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:03:56] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:09:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:18:50] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms
[18:48:30] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[18:54:30] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms
[18:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:05:13] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[19:39:56] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:41:00] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:47:10] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 2.85 ms
[19:52:48] <icinga-wm>	 RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:55:30] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:01:30] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms
[20:40:46] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:53:54] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:57:20] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 174 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:59:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:18:22] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms
[22:00:22] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:08:02] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:09:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:07:42] <wikibugs>	 (03PS4) 10TheDJ: Translate TimedText namespaces for iswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824444 (https://phabricator.wikimedia.org/T315715) (owner: 10MarcoAurelio)
[23:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient