[00:00:42] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:12] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:32] (03PS1) 10Jbond: WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992 [00:11:14] (03CR) 10jerkins-bot: [V: 04-1] WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992 (owner: 10Jbond) [00:22:56] (03CR) 10Jbond: puppet_compiler: add pcc facts processor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [00:31:30] (03PS1) 10Ladsgroup: flaggedrevs: Fix idwiki's autoreview config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745995 (https://phabricator.wikimedia.org/T288404) [00:38:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={sidekiq,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:58] (03PS2) 10Krinkle: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson) [00:53:03] (03CR) 10Krinkle: [C: 03+1] Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson) [01:10:55] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Delete "releng" mailman account - https://phabricator.wikimedia.org/T294270 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It was quite a mess. I finally gave up trying to at suggested way: https://docs.mailman3.org/projects/mailman/en/latest/src/mailm... [01:39:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RKemper) >>! In T294152#7563822, @Jclark-ctr wrote: > @RKemper rack A6 is not 10g rack b4 has no space. Are there any other requi... [02:21:10] 10SRE, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 (10RKemper) [02:22:05] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10RKemper) [02:23:14] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:16] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:22] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:03:32] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 93.19% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [05:57:48] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:56] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211212T0800) [09:45:39] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) The message probably indicates that mmap() returned NULL when PHP tried to allocate memory. I don't... [12:36:10] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:49:50] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:50:38] PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:52:52] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:15:44] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:17:58] 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Yann) One more `Request from 92.145.93.28 via cp3054 cp3054, Varnish XID 49490611 Error: 503, Backend fetch failed at Sun, 12 Dec 2021 13:15:28 GMT` for https://www.archive.org/d... [13:21:42] PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:25:58] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:32:18] PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:36:36] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:37:10] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:56:20] PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:58:34] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:05:10] PROBLEM - SSH on graphite1004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:07:47] sigh graphite1004, I'll reboot it [14:08:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [14:08:11] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host graphite1004.eqiad.wmnet [14:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:19] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10ops-monitoring-bot) Host rebooted by filippo@cumin1001 with reason: None [14:09:24] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:16:54] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:23:44] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on graphite1004.eqiad.wmnet with reason: powercycle [14:23:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on graphite1004.eqiad.wmnet with reason: powercycle [14:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:26] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [14:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:35] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10ops-monitoring-bot) Host rebooted by filippo@cumin1001 with reason: revert back to linux 5.10.0-9 since graphite2003 has been stable so far [14:35:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [14:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:50] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:55] 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Yann) Several times the same error with https://archive.org/download/awakening_librivox/awakening_02_chopin.mp3 48.42 MB `Request from 92.145.93.28 via cp3054 cp3054, Varnish XID... [17:10:54] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:58] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:12:00] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:59:14] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:00:48] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:16] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:22:32] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:24:44] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:34:48] PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2021-12-09 21:01:31 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:03:02] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:20:00] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:04:12] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:41] (03PS2) 10Jbond: WIP: move towards asyncio [puppet] - 10https://gerrit.wikimedia.org/r/745992 [23:38:30] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:42:56] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:49:20] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook