[00:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:05:18] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returning 500's - https://phabricator.wikimedia.org/T323448 (10Ladsgroup) I can open it now, I think there is probably some crawlers putting too much pressure on the server. The logs say this :/
[00:07:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:21:47] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returning 500's - https://phabricator.wikimedia.org/T323448 (10RoySmith) Hmmm, yup, working for me now too.
[00:43:54] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:28] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:05:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:06:58] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:08:46] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:20:58] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[05:22:58] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[06:05:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:39:22] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[07:41:12] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221120T0800)
[09:34:24] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:34:46] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[09:40:30] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:05:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:12:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:35:42] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[10:37:44] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[11:02:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:16:58] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:17:06] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:18:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:18:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:50:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:34:52] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:40:54] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (8) High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (8) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (8) High Kubernetes API latency (UPDATE clusterissuers) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:19:47] <wikibugs>	 (03PS1) 10Stang: zhwiki: Install PageTriage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378)
[15:47:38] <wikibugs>	 (03PS2) 10Stang: zhwiki: Install PageTriage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378)
[15:48:57] <wikibugs>	 (03CR) 10Stang: zhwiki: Install PageTriage on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang)
[15:50:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:12:49] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5028: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858673 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[18:14:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS buster
[18:14:25] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5028.eqsin.wmnet with OS buster
[18:23:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:24:24] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2060 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:43:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage
[18:44:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:47:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage
[18:53:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:58:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:03:51] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.131.14:9042 on aqs1020 is OK: TCP OK - 0.000 second response time on 10.64.131.14 port 9042 https://phabricator.wikimedia.org/T93886
[19:11:14] <wikibugs>	 (03PS1) 10Stang: zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859)
[19:12:51] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[19:14:39] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:16:00] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5028.eqsin.wmnet with OS buster
[19:16:09] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5028.eqsin.wmnet with OS buster completed: - cp5028 (**PASS**)   -...
[19:22:57] <wikibugs>	 (03PS2) 10Stang: zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859)
[19:23:33] <wikibugs>	 (03PS3) 10Stang: zhwiki: Revert 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858709 (https://phabricator.wikimedia.org/T320859)
[19:42:25] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:44:27] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:50:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:00:09] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:02:03] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:17:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:21:36] <wikibugs>	 (03CR) 10Gergő Tisza: "Causes" [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond)
[20:22:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:27:37] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible: Beta mwmaint puppet runs fail with "Resource type not found: Profile::Lvs::Classes" - https://phabricator.wikimedia.org/T322901 (10Tgr) 05Open→03Resolved a:03Joe Was fixed in {221d0f75a0f769c2d56c1ef53a0a86a49d9d68e9}.  (Now we get `C...
[20:29:00] <urandom>	 !log initiating Cassandra bootstrap, aqs1020-b -- T307802
[20:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:39] <stashbot>	 T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802
[20:31:09] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1020 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:31:53] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.131.15:7001 on aqs1020 is OK: SSL OK - Certificate aqs1020-b valid until 2024-11-08 15:06:37 +0000 (expires in 718 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:38:43] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:39:13] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:40:06] <wikibugs>	 (03PS1) 10Stang: Fix typo in tests/LoggingTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858715
[20:52:30] <wikibugs>	 (03PS1) 10AOkoth: vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059)
[20:53:39] <wikibugs>	 (03PS2) 10AOkoth: vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059)
[21:36:51] <icinga-wm>	 PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:39:49] <wikibugs>	 (03PS1) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717
[22:45:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:21:49] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[23:23:51] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[23:36:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:50:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency