[00:21:03] <icinga-wm>	 RECOVERY - Disk space on maps2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2009&var-datasource=codfw+prometheus/ops
[00:22:55] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:34] <wikibugs>	 (03CR) 10DannyS712: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712)
[01:38:45] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[02:00:41] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[02:49:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:49:03] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:05:11] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[03:45:21] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[04:19:34] <icinga-wm>	 RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[04:53:29] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April
[04:53:29] <icinga-wm>	 6 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for Januar
[04:53:29] <icinga-wm>	 6 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected 
[04:53:29] <icinga-wm>	 03 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:55:25] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[04:57:47] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 162.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[05:19:52] <wikibugs>	 10SRE, 10DBA, 10Traffic, 10Wikimedia-Incident: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Peachey88)
[05:51:39] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 58.98 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[06:18:13] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:27:39] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 58.98 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210905T0700)
[07:08:27] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 765.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[07:11:57] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:43] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:27] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:00:15] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10Legoktm) 05Open→03Resolved a:03Legoktm And...hasn't happened since. Closing as resolved as just a fluke.
[08:09:27] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 51.86 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[08:23:33] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:25:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:27:21] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 27 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:34:23] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[09:03:53] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[09:05:09] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[09:13:01] <wikibugs>	 (03CR) 10Aklapper: [C: 03+1] Phabricator: add override for the browser time zone conflict message [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712)
[09:26:19] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[10:21:55] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:02:45] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:02:55] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:06:45] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:46:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Netbox Errors in codfw - https://phabricator.wikimedia.org/T290362 (10Reedy)
[12:02:49] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 272.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[12:56:35] <wikibugs>	 10SRE, 10DBA, 10Traffic, 10User-Ladsgroup, 10Wikimedia-Incident: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Ladsgroup) a:03Ladsgroup
[12:57:57] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 53.9 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[13:04:15] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:12:13] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:19:51] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:19] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:34] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 129.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[14:13:01] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:29:03] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 153.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[15:28:37] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[16:16:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:18:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:42:29] <wikibugs>	 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10Xuanvy95) 05Resolved→03Open
[17:44:07] <wikibugs>	 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10DannyS712) @Xuanvy95 why did you reopen this?
[18:12:52] <wikibugs>	 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10Zabe) a:05Zabe→03None
[18:13:19] <wikibugs>	 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10Zabe) 05Open→03Resolved a:03Zabe No response, account disabled, closing.
[18:20:41] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 134.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[18:54:54] <urbanecm>	 !log wikiadmin@10.192.0.119(ptwiki)> update protected_titles set pt_create_perm='editautoreviewprotected' where pt_create_perm='autoreviewer'; # T290396
[18:54:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:00] <stashbot>	 T290396: Titles that cannot be unprotected on ptwiki - https://phabricator.wikimedia.org/T290396
[19:12:19] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 40.68 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[19:19:47] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1175.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:52:21] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:07] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:17:34] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) >>! In T209149#7332532, @...
[20:34:41] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] readme: Update readme for many changes that went undocumented [deployment-charts] - 10https://gerrit.wikimedia.org/r/716553 (owner: 10BryanDavis)
[20:37:45] <wikibugs>	 (03Merged) 10jenkins-bot: readme: Update readme for many changes that went undocumented [deployment-charts] - 10https://gerrit.wikimedia.org/r/716553 (owner: 10BryanDavis)
[20:43:51] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:53] <icinga-wm>	 PROBLEM - SSH on ms-be1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:51:37] <icinga-wm>	 RECOVERY - SSH on ms-be1040 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:00:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The following units failed: session-189325.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:15] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:15:01] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The following units failed: session-71575.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:21:41] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:24:41] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:26:21] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:37] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:45:59] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 214.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[21:46:31] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 217.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[22:01:57] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:47] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:37:11] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 62.03 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[22:37:47] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 39.66 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37
[22:50:09] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[22:58:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The following units failed: session-189375.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:33] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:35] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:07] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:22:01] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:27] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:30:33] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37
[23:34:49] <wikibugs>	 10SRE, 10Wikimedia-Incident: 2021-09-04 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10AntiCompositeNumber)