[00:22:47] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:22:58] <wikibugs>	 (03CR) 10Ivi104: [C: 03+1] "Many thanks Aca!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca)
[06:00:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 203 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:04:21] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 33 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:08:11] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 527 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:10:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:28:19] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210808T0700)
[07:28:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:29:07] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:34:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:44:49] <wikibugs>	 (03Abandoned) 10Labdajiwa: Set the project namespace and sitename for Javanese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708206 (https://phabricator.wikimedia.org/T287437) (owner: 10Labdajiwa)
[08:17:10] <wikibugs>	 (03PS3) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919)
[08:18:42] <wikibugs>	 (03CR) 10Elukey: "Tested the image and it seems working fine. On the size side:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey)
[08:23:02] <wikibugs>	 (03PS4) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919)
[08:23:19] <wikibugs>	 (03CR) 10Elukey: "Removed one layer that was not needed :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey)
[08:25:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:28:07] <wikibugs>	 (03PS5) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919)
[08:29:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:17:15] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[10:19:09] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[10:30:31] <wikibugs>	 (03PS1) 10Jelto: fix shell for backup cronjob [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/710676 (https://phabricator.wikimedia.org/T288324)
[10:51:47] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[10:53:41] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:34:22] <wikibugs>	 (03PS3) 10Fomafix: Add 'nrf' as alias for 'nrm' [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216)
[12:35:49] <wikibugs>	 (03PS4) 10Fomafix: Add 'nrf' as alias for 'nrm' [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216)
[12:37:17] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:39:13] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[14:28:37] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[14:30:33] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[16:12:15] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[16:13:39] <wikibugs>	 (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca)
[16:14:09] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[16:25:31] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:17] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[16:58:11] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[17:01:57] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:49] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:19] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:11] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:01] <icinga-wm>	 PROBLEM - Disk space on wdqs2003 is CRITICAL: DISK CRITICAL - free space: /srv 113574 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2003&var-datasource=codfw+prometheus/ops
[19:01:19] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:39] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:43] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:44:55] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10Peachey88) a:05CapitainAfrika→03None
[20:02:07] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:08:13] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:27:07] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:13] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:57] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:19:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca)
[21:25:43] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:11] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:12:09] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:14:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:27:03] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:33] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:23] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:28:35] <icinga-wm>	 RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-08-08 21:56:59 (583 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting