[00:00:42] <icinga-wm>	 RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:12] <icinga-wm>	 PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:32] <wikibugs>	 (03PS1) 10Jbond: WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992
[00:11:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992 (owner: 10Jbond)
[00:22:56] <wikibugs>	 (03CR) 10Jbond: puppet_compiler: add pcc facts processor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond)
[00:31:30] <wikibugs>	 (03PS1) 10Ladsgroup: flaggedrevs: Fix idwiki's autoreview config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745995 (https://phabricator.wikimedia.org/T288404)
[00:38:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={sidekiq,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:52:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:52:58] <wikibugs>	 (03PS2) 10Krinkle: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson)
[00:53:03] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson)
[01:10:55] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Delete "releng" mailman account - https://phabricator.wikimedia.org/T294270 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It was quite a mess. I finally gave up trying to at suggested way: https://docs.mailman3.org/projects/mailman/en/latest/src/mailm...
[01:39:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RKemper) >>! In T294152#7563822, @Jclark-ctr wrote: > @RKemper  rack A6 is not 10g rack b4 has no space.  Are there any other requi...
[02:21:10] <wikibugs>	 10SRE, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 (10RKemper)
[02:22:05] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10RKemper)
[02:23:14] <icinga-wm>	 PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:24:16] <icinga-wm>	 RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:25:22] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:03:32] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 93.19% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[05:57:48] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:58:56] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211212T0800)
[09:45:39] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) The message probably indicates that mmap() returned NULL when PHP tried to allocate memory. I don't...
[12:36:10] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:49:50] <icinga-wm>	 RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[12:50:38] <icinga-wm>	 PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:52:52] <icinga-wm>	 RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:15:44] <icinga-wm>	 PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:17:58] <wikibugs>	 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Yann) One more  `Request from 92.145.93.28 via cp3054 cp3054, Varnish XID 49490611 Error: 503, Backend fetch failed at Sun, 12 Dec 2021 13:15:28 GMT` for https://www.archive.org/d...
[13:21:42] <icinga-wm>	 PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:25:58] <icinga-wm>	 RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:32:18] <icinga-wm>	 PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:36:36] <icinga-wm>	 RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:37:10] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:56:20] <icinga-wm>	 PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:58:34] <icinga-wm>	 RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:05:10] <icinga-wm>	 PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:07:47] <godog>	 sigh graphite1004, I'll reboot it
[14:08:09] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet
[14:08:11] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host graphite1004.eqiad.wmnet
[14:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:19] <wikibugs>	 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10ops-monitoring-bot) Host rebooted by filippo@cumin1001 with reason: None
[14:09:24] <icinga-wm>	 RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:16:54] <icinga-wm>	 RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:23:44] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on graphite1004.eqiad.wmnet with reason: powercycle
[14:23:45] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on graphite1004.eqiad.wmnet with reason: powercycle
[14:23:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:26] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet
[14:30:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:35] <wikibugs>	 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10ops-monitoring-bot) Host rebooted by filippo@cumin1001 with reason: revert back to linux 5.10.0-9 since graphite2003 has been stable so far
[14:35:55] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet
[14:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:50] <icinga-wm>	 PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:12:55] <wikibugs>	 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Yann) Several times the same error with https://archive.org/download/awakening_librivox/awakening_02_chopin.mp3 48.42 MB  `Request from 92.145.93.28 via cp3054 cp3054, Varnish XID...
[17:10:54] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:13:58] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:12:00] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:59:14] <icinga-wm>	 RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:00:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:16:16] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:22:32] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[21:24:44] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[21:34:48] <icinga-wm>	 PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2021-12-09 21:01:31 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[22:03:02] <icinga-wm>	 PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:20:00] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:04:12] <icinga-wm>	 RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:14:41] <wikibugs>	 (03PS2) 10Jbond: WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992
[23:38:30] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:42:56] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:49:20] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook