[00:00:05] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:05:39] PROBLEM - Check systemd state on ms-fe2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:39] PROBLEM - Check systemd state on ganeti2009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:27] RECOVERY - Check systemd state on ms-fe2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:27] RECOVERY - Check systemd state on ganeti2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:20] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) [03:12:51] PROBLEM - Check systemd state on doh5002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:49] RECOVERY - Check systemd state on doh5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:56:33] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:21] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:55] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:55] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:43] (03CR) 10VolkerE: [C: 04-1] "Some notes inside." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [05:57:21] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:20] (03PS7) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [06:02:48] (03CR) 10Juan90264: Adding and use wordmark in azwiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [06:15:52] (03CR) 10Juan90264: "I hope you review this change. I confess I'm already getting tired of these changes, I have several other changes that many reviewers simp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [07:23:03] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:30:25] PROBLEM - Check systemd state on ping3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:17] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10Legoktm) [10:57:21] RECOVERY - Check systemd state on ping3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:15] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:45:09] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:07:27] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:09:23] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [16:09:17] 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Reedy) [16:09:28] 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Reedy) [16:09:54] 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Reedy) 05Open→03Stalled Marking stalled until usages inside MW are removed. [17:10:47] PROBLEM - MegaRAID on an-worker1096 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:10:58] ACKNOWLEDGEMENT - MegaRAID on an-worker1096 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T290805 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:11:03] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10ops-monitoring-bot) [17:17:14] PROBLEM - Disk space on maps2006 is CRITICAL: DISK CRITICAL - free space: / 2500 MB (3% inode=98%): /tmp 2500 MB (3% inode=98%): /var/tmp 2500 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2006&var-datasource=codfw+prometheus/ops [18:28:34] (03PS1) 10Urbanecm: Revert "test: Add electcomm and electionadmin groups" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720265 (https://phabricator.wikimedia.org/T290808) [18:28:39] (03PS2) 10Urbanecm: Revert "test: Add electcomm and electionadmin groups" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720265 (https://phabricator.wikimedia.org/T290808) [18:28:45] (03CR) 10Urbanecm: [C: 03+2] "emergency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720265 (https://phabricator.wikimedia.org/T290808) (owner: 10Urbanecm) [18:30:03] (03Merged) 10jenkins-bot: Revert "test: Add electcomm and electionadmin groups" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720265 (https://phabricator.wikimedia.org/T290808) (owner: 10Urbanecm) [18:31:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 908bbf35235ea4129795dfbf4c0e646440152e18: Revert "test: Add electcomm and electionadmin groups" (T290808) (duration: 00m 58s) [18:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:59] !log [urbanecm@mwmaint2002 ~]$ mwscript emptyUserGroup.php --wiki=testwiki {electionadmin,electcomm} # T290808 [18:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:32] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:50:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:52:14] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:52:18] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:08] (03PS1) 10Urbanecm: testwiki: Fully remove securepoll-related groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720454 (https://phabricator.wikimedia.org/T290808) [18:58:22] (03CR) 10Urbanecm: [C: 03+2] "emergency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720454 (https://phabricator.wikimedia.org/T290808) (owner: 10Urbanecm) [18:59:16] (03Merged) 10jenkins-bot: testwiki: Fully remove securepoll-related groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720454 (https://phabricator.wikimedia.org/T290808) (owner: 10Urbanecm) [19:02:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 27814b8eaacb5ba2fee1b6167a36ea14356a1ecf: testwiki: Fully remove securepoll-related groups (T290808) (duration: 00m 57s) [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:49] (03PS1) 10Urbanecm: Add throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720458 (https://phabricator.wikimedia.org/T290809) [22:39:34] PROBLEM - Check systemd state on ganeti3003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:04] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10Peachey88) [23:05:34] RECOVERY - Check systemd state on ganeti3003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state