[00:00:17] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:00:19] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:33] <jinxer-wm>	 (JobUnavailable) firing: (40) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:02:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T333332)', diff saved to https://phabricator.wikimedia.org/P46876 and previous config saved to /var/cache/conftool/dbconfig/20230415-000226-ladsgroup.json
[00:02:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance
[00:02:32] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:02:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P46877 and previous config saved to /var/cache/conftool/dbconfig/20230415-000239-ladsgroup.json
[00:02:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance
[00:02:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T333332)', diff saved to https://phabricator.wikimedia.org/P46878 and previous config saved to /var/cache/conftool/dbconfig/20230415-000249-ladsgroup.json
[00:05:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T333332)', diff saved to https://phabricator.wikimedia.org/P46879 and previous config saved to /var/cache/conftool/dbconfig/20230415-000500-ladsgroup.json
[00:05:32] <jinxer-wm>	 (JobUnavailable) firing: (38) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:10:32] <jinxer-wm>	 (JobUnavailable) firing: (35) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:16:19] <icinga-wm_>	 RECOVERY - Disk space on urldownloader1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops
[00:17:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P46880 and previous config saved to /var/cache/conftool/dbconfig/20230415-001745-ladsgroup.json
[00:19:02] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:20:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P46881 and previous config saved to /var/cache/conftool/dbconfig/20230415-002006-ladsgroup.json
[00:20:33] <jinxer-wm>	 (JobUnavailable) firing: (23) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:21:57] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:24:02] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:25:32] <jinxer-wm>	 (JobUnavailable) firing: (22) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:31:23] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[00:32:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46882 and previous config saved to /var/cache/conftool/dbconfig/20230415-003251-ladsgroup.json
[00:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[00:32:57] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:33:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[00:33:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46883 and previous config saved to /var/cache/conftool/dbconfig/20230415-003315-ladsgroup.json
[00:33:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[00:35:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P46884 and previous config saved to /var/cache/conftool/dbconfig/20230415-003512-ladsgroup.json
[00:39:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908869
[00:39:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908869 (owner: 10TrainBranchBot)
[00:42:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46885 and previous config saved to /var/cache/conftool/dbconfig/20230415-004233-ladsgroup.json
[00:42:40] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:47:07] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[00:48:41] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[00:50:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T333332)', diff saved to https://phabricator.wikimedia.org/P46886 and previous config saved to /var/cache/conftool/dbconfig/20230415-005019-ladsgroup.json
[00:50:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance
[00:50:25] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[00:50:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance
[00:50:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T333332)', diff saved to https://phabricator.wikimedia.org/P46887 and previous config saved to /var/cache/conftool/dbconfig/20230415-005042-ladsgroup.json
[00:52:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T333332)', diff saved to https://phabricator.wikimedia.org/P46888 and previous config saved to /var/cache/conftool/dbconfig/20230415-005252-ladsgroup.json
[00:56:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908869 (owner: 10TrainBranchBot)
[00:57:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P46889 and previous config saved to /var/cache/conftool/dbconfig/20230415-005740-ladsgroup.json
[01:06:21] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:07:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P46890 and previous config saved to /var/cache/conftool/dbconfig/20230415-010759-ladsgroup.json
[01:12:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P46891 and previous config saved to /var/cache/conftool/dbconfig/20230415-011246-ladsgroup.json
[01:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:15:53] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:23:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P46892 and previous config saved to /var/cache/conftool/dbconfig/20230415-012305-ladsgroup.json
[01:27:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46893 and previous config saved to /var/cache/conftool/dbconfig/20230415-012753-ladsgroup.json
[01:27:58] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[01:31:00] <wikibugs>	 (03PS1) 10Ssingh: cache::haproxy: enable systemd-coredump [puppet] - 10https://gerrit.wikimedia.org/r/908934 (https://phabricator.wikimedia.org/T334448)
[01:32:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40682/console" [puppet] - 10https://gerrit.wikimedia.org/r/908934 (https://phabricator.wikimedia.org/T334448) (owner: 10Ssingh)
[01:38:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T333332)', diff saved to https://phabricator.wikimedia.org/P46894 and previous config saved to /var/cache/conftool/dbconfig/20230415-013811-ladsgroup.json
[01:38:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance
[01:38:17] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[01:38:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance
[01:38:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T333332)', diff saved to https://phabricator.wikimedia.org/P46895 and previous config saved to /var/cache/conftool/dbconfig/20230415-013835-ladsgroup.json
[01:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T333332)', diff saved to https://phabricator.wikimedia.org/P46896 and previous config saved to /var/cache/conftool/dbconfig/20230415-014045-ladsgroup.json
[01:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P46897 and previous config saved to /var/cache/conftool/dbconfig/20230415-015551-ladsgroup.json
[02:02:54] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10AntiCompositeNumber) action=purge on the file description page is *supposed* to work, it just sometimes doesn't for reaso...
[02:06:41] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P46898 and previous config saved to /var/cache/conftool/dbconfig/20230415-021057-ladsgroup.json
[02:12:53] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10doctaxon) Like I could determine action=purge on file description page has a visible result in 40% of all usages, not mor...
[02:16:15] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:25:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T333332)', diff saved to https://phabricator.wikimedia.org/P46899 and previous config saved to /var/cache/conftool/dbconfig/20230415-022604-ladsgroup.json
[02:26:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance
[02:26:10] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[02:26:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance
[02:26:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T333332)', diff saved to https://phabricator.wikimedia.org/P46900 and previous config saved to /var/cache/conftool/dbconfig/20230415-022627-ladsgroup.json
[02:28:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T333332)', diff saved to https://phabricator.wikimedia.org/P46901 and previous config saved to /var/cache/conftool/dbconfig/20230415-022837-ladsgroup.json
[02:43:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P46902 and previous config saved to /var/cache/conftool/dbconfig/20230415-024344-ladsgroup.json
[02:51:15] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:58:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P46903 and previous config saved to /var/cache/conftool/dbconfig/20230415-025850-ladsgroup.json
[03:00:51] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:06:55] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[03:08:29] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[03:13:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T333332)', diff saved to https://phabricator.wikimedia.org/P46904 and previous config saved to /var/cache/conftool/dbconfig/20230415-031356-ladsgroup.json
[03:14:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance
[03:14:02] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[03:14:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance
[03:14:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[03:14:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[03:14:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T333332)', diff saved to https://phabricator.wikimedia.org/P46905 and previous config saved to /var/cache/conftool/dbconfig/20230415-031437-ladsgroup.json
[03:16:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T333332)', diff saved to https://phabricator.wikimedia.org/P46906 and previous config saved to /var/cache/conftool/dbconfig/20230415-031648-ladsgroup.json
[03:21:33] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:07] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P46907 and previous config saved to /var/cache/conftool/dbconfig/20230415-033154-ladsgroup.json
[03:32:23] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[03:33:59] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[03:47:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P46908 and previous config saved to /var/cache/conftool/dbconfig/20230415-034700-ladsgroup.json
[03:51:47] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:01:19] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T333332)', diff saved to https://phabricator.wikimedia.org/P46909 and previous config saved to /var/cache/conftool/dbconfig/20230415-040207-ladsgroup.json
[04:02:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance
[04:02:14] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[04:02:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance
[04:02:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T333332)', diff saved to https://phabricator.wikimedia.org/P46910 and previous config saved to /var/cache/conftool/dbconfig/20230415-040230-ladsgroup.json
[04:04:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T333332)', diff saved to https://phabricator.wikimedia.org/P46911 and previous config saved to /var/cache/conftool/dbconfig/20230415-040440-ladsgroup.json
[04:07:23] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[04:08:57] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[04:09:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:14:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:19:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P46912 and previous config saved to /var/cache/conftool/dbconfig/20230415-041947-ladsgroup.json
[04:22:07] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:30:07] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:34:29] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[04:34:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P46913 and previous config saved to /var/cache/conftool/dbconfig/20230415-043453-ladsgroup.json
[04:36:05] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[04:37:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[04:38:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[04:50:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T333332)', diff saved to https://phabricator.wikimedia.org/P46914 and previous config saved to /var/cache/conftool/dbconfig/20230415-044959-ladsgroup.json
[04:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[04:50:05] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[04:50:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[04:50:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T333332)', diff saved to https://phabricator.wikimedia.org/P46915 and previous config saved to /var/cache/conftool/dbconfig/20230415-045023-ladsgroup.json
[04:52:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T333332)', diff saved to https://phabricator.wikimedia.org/P46916 and previous config saved to /var/cache/conftool/dbconfig/20230415-045233-ladsgroup.json
[05:07:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P46917 and previous config saved to /var/cache/conftool/dbconfig/20230415-050739-ladsgroup.json
[05:07:57] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[05:09:33] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[05:11:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:16:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:22:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P46918 and previous config saved to /var/cache/conftool/dbconfig/20230415-052246-ladsgroup.json
[05:37:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T333332)', diff saved to https://phabricator.wikimedia.org/P46919 and previous config saved to /var/cache/conftool/dbconfig/20230415-053752-ladsgroup.json
[05:37:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[05:37:58] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[05:37:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[05:38:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T333332)', diff saved to https://phabricator.wikimedia.org/P46920 and previous config saved to /var/cache/conftool/dbconfig/20230415-053804-ladsgroup.json
[05:40:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T333332)', diff saved to https://phabricator.wikimedia.org/P46921 and previous config saved to /var/cache/conftool/dbconfig/20230415-054015-ladsgroup.json
[05:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P46922 and previous config saved to /var/cache/conftool/dbconfig/20230415-055521-ladsgroup.json
[06:10:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P46923 and previous config saved to /var/cache/conftool/dbconfig/20230415-061028-ladsgroup.json
[06:21:41] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:25:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T333332)', diff saved to https://phabricator.wikimedia.org/P46924 and previous config saved to /var/cache/conftool/dbconfig/20230415-062534-ladsgroup.json
[06:25:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance
[06:25:40] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[06:25:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance
[06:25:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T333332)', diff saved to https://phabricator.wikimedia.org/P46925 and previous config saved to /var/cache/conftool/dbconfig/20230415-062558-ladsgroup.json
[06:28:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T333332)', diff saved to https://phabricator.wikimedia.org/P46926 and previous config saved to /var/cache/conftool/dbconfig/20230415-062808-ladsgroup.json
[06:31:19] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:07] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P46927 and previous config saved to /var/cache/conftool/dbconfig/20230415-064314-ladsgroup.json
[06:45:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:58:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P46928 and previous config saved to /var/cache/conftool/dbconfig/20230415-065821-ladsgroup.json
[07:00:06] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230415T0700)
[07:13:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T333332)', diff saved to https://phabricator.wikimedia.org/P46929 and previous config saved to /var/cache/conftool/dbconfig/20230415-071327-ladsgroup.json
[07:13:33] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[07:22:13] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:11] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:43] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:16:15] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:21:03] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:39] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:33] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[08:38:30] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[09:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:27] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[09:40:03] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[09:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:59:17] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[10:02:29] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[10:25:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:32:27] <icinga-wm_>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:33:57] <icinga-wm_>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:39:07] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[10:40:43] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[10:53:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:58:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:10:56] <Kizule>	 Hi, why is downloading speed for dumps.wikimedia.org just about 5 MB/s?
[11:28:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:33:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:51:59] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:17] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[11:59:16] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[12:00:03] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:18] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[12:04:15] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[12:17:16] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334783 (10phaultfinder)
[12:17:18] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[12:20:15] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder)
[12:20:18] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[12:22:16] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[12:22:18] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[12:25:16] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[12:25:18] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[12:40:29] <wikibugs>	 (03PS1) 10Stang: ruwiki: Allow sysop to add/remove confirmed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908945 (https://phabricator.wikimedia.org/T334780)
[12:42:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[12:43:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[13:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:15:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:20:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:44:27] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/
[13:47:41] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/
[13:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:49:19] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) Note that the fix hasn't been deployed yet (we need someone to review and merge it and then I can deploy the f...
[13:51:57] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:37] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:27] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:07] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:59] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:01] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:53] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:15:33] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:21:59] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:31] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:30:53] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:31:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:01] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/
[15:32:29] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:33:37] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/
[15:34:19] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:36:25] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:25] <icinga-wm_>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:39:01] <icinga-wm_>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:46:03] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:53] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:31] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[16:04:31] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[16:21:23] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:33] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[16:25:31] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[16:29:11] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:30:47] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:30:59] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:49] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:38:25] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:42:32] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[16:43:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[16:54:31] <icinga-wm_>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:56:35] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:05:47] <icinga-wm_>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:06:15] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:06:17] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:15:55] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:25:42] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10MusikAnimal) Woohoo! Thanks so much to Tgr and everyone...
[17:33:13] <icinga-wm_>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:33:33] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:35:07] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:36:25] <icinga-wm_>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:42:53] <icinga-wm_>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:44:29] <icinga-wm_>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:51:05] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:45] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:07:09] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:15:13] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:06:37] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:16:17] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:37:07] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:09] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:37] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:19] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:16] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[20:09:14] <wikibugs>	 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder)
[20:27:16] <wikibugs>	 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder)
[20:30:15] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder)
[20:36:39] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:15] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[20:48:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[20:51:05] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:45] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:15:17] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/
[21:16:53] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/
[21:23:13] <icinga-wm_>	 PROBLEM - Disk space on urldownloader1001 is CRITICAL: DISK CRITICAL - free space: / 328 MB (3% inode=89%): /tmp 328 MB (3% inode=89%): /var/tmp 328 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops
[21:36:03] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:39] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:51:43] <icinga-wm_>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:14:37] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/
[22:16:13] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/
[22:20:59] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:30:39] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:44:03] <wikibugs>	 (03PS1) 10Superpes15: [trwikiquote] Add a HD logo for Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908972 (https://phabricator.wikimedia.org/T334732)
[23:05:57] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:15:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:23] <wikibugs>	 (03PS1) 10Superpes15: [ruwiki] Add 'sboverride' right to the bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908973 (https://phabricator.wikimedia.org/T334344)
[23:21:00] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908973 (https://phabricator.wikimedia.org/T334344) (owner: 10Superpes15)
[23:24:58] <wikibugs>	 (03PS2) 10Superpes15: [trwikiquote] Add a HD logo for Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908972 (https://phabricator.wikimedia.org/T334732)