[00:02:15] (03CR) 10Legoktm: [C: 04-1] cirrus: systemd timer for readahead script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [00:07:52] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:14] PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [00:27:06] RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:26] RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [00:32:47] (Traffic bill over quota) firing: (4) Traffic bill over quota - https://alerts.wikimedia.org [00:33:50] (03PS1) 10H.krishna123: [WIP] api_db: Add code to enable database connection Added code to connect to an SQL database, added skeleton for unit tests, cleaned up main.py file and added a singleton class to keep database configuration the same throughout the program. Added DB query functionality for the readiness probe [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) [00:34:41] (03PS2) 10H.krishna123: [WIP] api_db: Add code to enable database connection [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) [00:34:46] (03PS12) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [00:36:07] (03CR) 10Ryan Kemper: cirrus: systemd timer for readahead script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [00:37:47] (Traffic bill over quota) firing: (7) Traffic bill over quota - https://alerts.wikimedia.org [00:52:47] (Traffic bill over quota) firing: (7) Traffic bill over quota - https://alerts.wikimedia.org [00:57:47] (Traffic bill over quota) resolved: (3) Traffic bill over quota - https://alerts.wikimedia.org [01:16:29] !log uploaded elasticsearch-madvise 0.1 to apt.wm.o (T264053) [01:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:39] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [01:26:59] (03PS13) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [01:28:35] (03CR) 10Ryan Kemper: cirrus: systemd timer for readahead script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [01:36:33] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [01:39:07] (03PS1) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702785 [01:41:16] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702785 (owner: 10Ryan Kemper) [01:46:44] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:48] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:02] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:04] PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:12] PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:14] PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:14] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:18] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:22] PROBLEM - Check systemd state on elastic2049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:26] PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:26] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:26] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:26] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:28] PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:28] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:30] PROBLEM - Check systemd state on elastic2039 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:32] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:36] PROBLEM - Check systemd state on elastic2041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:36] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:36] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:36] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:36] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:38] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:38] PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:40] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:40] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:42] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:42] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:46] PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:48] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:48] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:50] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:52] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:54] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:58] PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:58] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:58] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:04] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:08] PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:08] PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:14] PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:14] PROBLEM - Check systemd state on elastic1062 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:16] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:16] PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:18] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:18] PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:18] PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:22] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:24] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:26] PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:30] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:30] PROBLEM - Check systemd state on elastic1041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:34] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04713 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [01:49:38] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:40] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:40] PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:40] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:44] PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:44] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:44] PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:46] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:48] PROBLEM - Check systemd state on elastic2042 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:50] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:50] PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:50] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:50] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:50] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:50] PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:54] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:54] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:54] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:56] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:00] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:00] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:00] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:02] PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:04] PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:14] PROBLEM - Check systemd state on elastic1033 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:21] (03PS1) 10Ryan Kemper: cirrus: fix broken module path [puppet] - 10https://gerrit.wikimedia.org/r/702788 (https://phabricator.wikimedia.org/T264053) [01:50:22] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:26] PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:44] ^ Fixing this right now [01:51:01] Sorry for the wall of text [01:51:52] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: fix broken module path [puppet] - 10https://gerrit.wikimedia.org/r/702788 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [01:59:06] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Aklapp... [02:02:12] (03PS1) 10Ryan Kemper: Revert 1c99db9965361cdf95f042bb2401e86733a31393 [puppet] - 10https://gerrit.wikimedia.org/r/702790 (https://phabricator.wikimedia.org/T264053) [02:02:43] `elasticsearch-madvise` seems to have failed to install. I think it's something simple, but getting my changes reverted first: https://gerrit.wikimedia.org/r/c/operations/puppet/+/702790/ [02:05:23] (03CR) 10Ryan Kemper: [C: 03+2] Revert 1c99db9965361cdf95f042bb2401e86733a31393 [puppet] - 10https://gerrit.wikimedia.org/r/702790 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [02:09:02] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:42] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:44] RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:46] RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:46] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:48] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:48] RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:48] RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:48] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:54] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:54] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:54] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:54] RECOVERY - Check systemd state on elastic2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:56] RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:56] RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:56] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:58] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:58] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:58] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:00] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:06] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:06] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:06] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:06] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:08] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:08] RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:08] RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:12] RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:16] RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:16] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:20] RECOVERY - Check systemd state on elastic1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:20] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:24] RECOVERY - Check systemd state on elastic2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:26] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:26] RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:28] RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:28] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:28] RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:30] RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:30] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:32] RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:34] RECOVERY - Check systemd state on elastic2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:34] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:38] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:38] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:38] RECOVERY - Check systemd state on elastic2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:38] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:38] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:40] RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:40] RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:42] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:42] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:42] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:46] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:46] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:46] RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:50] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:50] RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:52] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:56] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:58] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:00] RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:02] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:02] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:08] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:10] RECOVERY - Check systemd state on elastic1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:12] RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:16] RECOVERY - Check systemd state on elastic1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:16] RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:18] RECOVERY - Check systemd state on elastic1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:18] RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:20] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:20] RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:20] RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:24] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:26] RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:28] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:30] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:32] RECOVERY - Check systemd state on elastic1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:54] * ryankemper forgot that the host I ran puppet to test the changes on still had the manually installed dpkg thus didn't catch the mistake before running puppet across the fleet...classic [02:13:13] * ryankemper shakes fist at the general concept of state [02:13:14] alright we're back to a good state now [02:14:30] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001149 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:16:30] (03PS1) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702791 (https://phabricator.wikimedia.org/T264053) [02:17:37] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10AntiCompositeNumber) p:05High→03Medium >>! In T285875#7189001, @Legoktm wrote: > T226318#5282215 suggests... [02:20:17] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702791 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [02:35:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-mediawiki-private.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:13] !log uploaded elasticsearch-madvise_0.1~deb9u1_amd64.changes to stretch-wikimedia on apt1001 [03:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:27] !log T264053 `sudo -E cumin 'P:elasticsearch::cirrus' 'sudo disable-puppet "verify new deb package works - T264053"'` [03:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:34] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [03:05:38] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702791 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [03:06:16] !log T264053 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/702791; will run puppet on single host [03:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:04] !log T264053 `ryankemper@elastic2054:~$ sudo run-puppet-agent --force` [03:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:57] !log T264053 `Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install elasticsearch-madvise' returned 100: Reading package lists...` grr [03:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:55] !log T264053 `sudo -E cumin 'P:elasticsearch::cirrus' 'sudo apt update'` fixed the issue [03:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:03] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [03:14:57] !log T264053 `sudo -E cumin 'P:elasticsearch::cirrus' 'sudo run-puppet-agent --force'` [03:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:31] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10AntiCompositeNumber) 05Open→03Declined [04:29:32] (03PS1) 10Marostegui: db1129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702793 [04:30:52] (03CR) 10Marostegui: [C: 03+2] db1129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702793 (owner: 10Marostegui) [05:11:54] 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Aklapper) a:05jgleeson→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sen... [05:12:16] 10SRE: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Ple... [05:14:32] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please assign th... [05:15:07] 10SRE: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`).... [05:15:17] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10Aklapper) a:05RobH→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee o... [05:17:12] 10SRE, 10DNS, 10Traffic: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10Aklapper) a:05mark→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please a... [05:17:49] 10SRE: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10Aklapper) a:05Ottomata→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please assign this task to... [05:18:09] 10SRE, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (s... [05:18:42] 10SRE, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Plea... [05:19:12] 10SRE, 10Scap, 10serviceops, 10Goal, 10User-jijiki: SRE FY2019 Q3:TEC6: First steps towards Canary Deployments - https://phabricator.wikimedia.org/T213156 (10Aklapper) a:05jijiki→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails s... [05:20:49] 10SRE, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years... [05:21:18] 10SRE, 10Discovery-Search: Collect per-node latency statistics from each node separately - https://phabricator.wikimedia.org/T204982 (10Aklapper) a:05EBernhardson→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May... [05:22:15] 10SRE, 10Traffic: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and... [05:22:38] 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Aklapper) a:05Vgutierrez→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee... [05:22:46] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-Needs-Improvement: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to... [05:23:24] 10SRE, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to... [05:24:50] 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please assign this t... [05:47:26] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) @papaul thank you! [06:08:58] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) It appears that the host gets stuck at {F34535299}, probably something got messed up with the boot order [06:09:12] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) 05Resolved→03Open [06:19:07] (03CR) 10Jcrespo: "See my comments below." (033 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123) [06:42:26] (03CR) 10Jcrespo: "I strongly suggest you run a python code linter when developing- it will solve you many headaches early on, and most likely we will want t" [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210702T0700) [07:10:12] (03PS6) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [07:10:39] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:12:11] 10SRE, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Automated restarts are in place for most services and everything else is ongoing fine-tuning and add... [07:25:55] !log installing openjdk-8-dbg on wdqs1013 [07:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:26] (03PS7) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [07:27:54] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:29:10] (03PS8) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [07:38:29] (03PS9) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [07:44:22] (03CR) 10Jcrespo: "@jbond when you have time (not a priority) please review my wiki edits at https://wikitech.wikimedia.org/w/index.php?title=Puppet%2FWmflib" [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:46:29] thcipriani greg-g brennen: help! I'd like to do an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/702780 -- context is T285996 [07:46:30] T285996: [regression-wmf12] new accounts do not get GrowthExperiments features - https://phabricator.wikimedia.org/T285996 [07:49:30] (03CR) 10Jcrespo: "I think I attended all comments, this is what the new (single) rule does:" [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:49:59] (are the only people who can approve an emergency deploy based in Americas timezones?) [07:53:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1267.eqiad.wmnet [07:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1268.eqiad.wmnet [07:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:09] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1267.eqiad.wmnet [07:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:16] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1268.eqiad.wmnet [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:59] (03CR) 10JMeybohm: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [07:56:23] (03CR) 10Elukey: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [08:03:45] !log installing ipmitool security updates [08:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] (03PS1) 10Dzahn: site/install/conftool: decom mw1267, mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/702879 (https://phabricator.wikimedia.org/T280203) [08:11:32] (03PS1) 10Jelto: site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) [08:12:02] (03CR) 10Jelto: "please take a look" [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [08:12:15] PROBLEM - mediawiki-installation DSH group on mw1267 is CRITICAL: Host mw1267 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:14:31] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [08:14:54] (03Abandoned) 10Muehlenhoff: conf: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/702101 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:15:55] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1267 is CRITICAL: Host mw1267 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:17:00] 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) Status update: the error isn't new (as in, it didn't start appearing on Jun 27th) and thanos-sidecar also sometimes experiences the error. We have sporadic errors dat... [08:18:18] (03CR) 10Jelto: [C: 03+1] "lgtm +1" [puppet] - 10https://gerrit.wikimedia.org/r/702879 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [08:22:50] (03CR) 10Dzahn: [C: 03+1] "looks good to me, but we should add icinga downtimes, then merge.. and then they need to be added to conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [08:23:36] (03PS4) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926) [08:24:00] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [08:24:02] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [08:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:40] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [08:24:41] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:14] (03CR) 10Dzahn: [C: 03+2] site/install/conftool: decom mw1267, mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/702879 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [08:26:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1267.eqiad.wmnet [08:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:31] (03PS2) 10Jelto: site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) [08:31:46] (03CR) 10Dzahn: [C: 03+1] site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [08:32:53] (03PS1) 10Kosta Harlan: Fix handling of geEnabled flag [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702808 (https://phabricator.wikimedia.org/T285996) [08:36:54] (03PS3) 10Jelto: site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) [08:37:50] (03CR) 10Jelto: [C: 03+2] site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [08:45:33] 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) Also WRT the upstream bug https://bugs.launchpad.net/swift/+bug/1636663, it could be related but I couldn't get positive confirmation: there are connection timeouts (... [08:46:09] (03PS1) 10Volans: Use IcingaHosts instead of Icinga (analytics) [cookbooks] - 10https://gerrit.wikimedia.org/r/702883 [08:46:11] (03PS1) 10Volans: Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 [08:46:13] (03PS1) 10Volans: Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 [08:46:15] (03PS1) 10Volans: iUse IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 [08:49:37] PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:50:46] (03CR) 10Elukey: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [08:51:33] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:54:56] !log installing golang-docker-credential-helpers security updates [08:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:09] !log deploying emergency backport: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/702808 [09:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:37] !log installing node-hosted-git-info security updates [09:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:09] (03CR) 10David Caro: [C: 03+2] ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [09:04:17] (03CR) 10Gergő Tisza: [C: 03+2] "Emergency backport per T285996#7192814" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702808 (https://phabricator.wikimedia.org/T285996) (owner: 10Kosta Harlan) [09:04:20] !log decom'ing mw1267 [09:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:12] (03PS1) 10Jelto: add mcrouter certs for mw1414.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702888 (https://phabricator.wikimedia.org/T279309) [09:06:41] (03CR) 10Dzahn: [V: 03+1 C: 03+1] add mcrouter certs for mw1414.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702888 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [09:07:04] (03CR) 10Jelto: [V: 03+2 C: 03+2] add mcrouter certs for mw1414.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702888 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [09:08:49] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [09:14:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1267.eqiad.wmnet [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:01] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1267.eqiad.wmnet` - m... [09:17:38] (03PS5) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926) [09:19:16] !log restart blazegraph on wdqs1013 [09:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:44] (03PS1) 10David Caro: Revert "ceph.keyring: ensure that the bootstrap dir exists" [puppet] - 10https://gerrit.wikimedia.org/r/702892 [09:21:15] (03CR) 10Elukey: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [09:21:43] (03CR) 10David Caro: [C: 03+2] "This is breaking current runs on ceph machines." [puppet] - 10https://gerrit.wikimedia.org/r/702892 (owner: 10David Caro) [09:24:00] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0115 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:24:15] !log test thanos 0.21.1 locally on thanos-fe2001 and depool the host - T285835 [09:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:23] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [09:24:36] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.514e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:25:40] (03Merged) 10jenkins-bot: Fix handling of geEnabled flag [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702808 (https://phabricator.wikimedia.org/T285996) (owner: 10Kosta Harlan) [09:28:20] kostajh: it's on mwdebug2001 [09:29:04] tgr: having a look [09:30:07] tgr: I created an account on cswiki, and got welcome survey + homepage [09:32:16] 10SRE, 10User-MoritzMuehlenhoff: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10MoritzMuehlenhoff) [09:32:18] tgr: I also confirmed that geEnabled=1 & campaign redirects straight to homepage, while geEnabled=0 switches features off [09:32:22] so, lgtm [09:32:30] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.002844 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:32:40] 10SRE, 10User-MoritzMuehlenhoff: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10MoritzMuehlenhoff) [09:34:15] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10MoritzMuehlenhoff) [09:34:48] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/30095/" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:36:02] yeah, same here. I also got a WS-but-no-homepage with no extra flags so the randomization seems to work. [09:36:09] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) The main person working on this is Kunal and he was busy with deploying shellbox for Score l... [09:36:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1268.eqiad.wmnet [09:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:38] (03PS1) 10Matthias Mullie: Seperate between and controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) [09:37:45] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/GrowthExperiments/includes/HomepageHooks.php: Backport: [[gerrit:702808|Fix handling of geEnabled flag (T285996)]] (duration: 00m 57s) [09:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] T285996: [regression-wmf12] new accounts do not get GrowthExperiments features - https://phabricator.wikimedia.org/T285996 [09:40:18] help! I too have an emergency deployment request (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/702810 - An UploadWizard step is significantly unusable, and I only now realized there will be no deployments next week...) [09:41:23] matthiasmullie: Happy to deploy if there's an SRE around to confirm it's OK to deploy now. [09:41:57] +1, I was also thinking that might be worth a backport [09:42:29] (03PS1) 10Ema: varnish: use 403 instead of 429 where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) [09:43:17] All emergency contacts this week are on or near the West Coast, sadly. [09:44:38] Ping to thcipriani greg-g brennen dduvall for an emergency deploy request for T285579 for matthiasmullie. [09:44:39] T285579: "Add data" step of Upload Wizard broken - https://phabricator.wikimedia.org/T285579 [09:48:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1268.eqiad.wmnet [09:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:45] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1268.eqiad.wmnet` - m... [09:49:35] James_F, Lucas_WMDE - earlier on we had another emergency deployment, if the fix's scope is contained/not-too-broad I'd say that we can proceed. If possible let's do it now so more people are around (not later on in the afternoon) [09:50:36] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10MoritzMuehlenhoff) >>! In T284772#7187773, @dancy wrote: > Most of the time releases1002 isn't doing much, just waiting for jobs to be triggered. One of the jobs (mediawiki-config-pipeline-wmf-publish) curre... [09:51:34] 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) This is done, isn't it? The performance issues are being mitigated by migrating to nginx light I think (someone needs to double check) [09:52:36] Fine, let's do it. [09:52:47] yay! [09:52:48] (03CR) 10Jforrester: [C: 03+2] Seperate between and controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie) [09:52:56] (03PS1) 10David Caro: ceph.keyring: make sure the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) [09:53:11] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005754 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:54:53] 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) TLSv1.3 is up & running, performance issues are being mitigated by replacing ats-tls with envoy or haproxy in the short term :) [09:54:55] (03CR) 10David Caro: [C: 04-1] ceph.keyring: make sure the bootstrap dir exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [09:55:12] (03PS2) 10David Caro: ceph.keyring: make sure the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) [09:55:14] (03CR) 10David Caro: ceph.keyring: make sure the bootstrap dir exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [09:55:18] 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) >>! In T285835#7193102, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/BIyIZnoBa_6PSCT9b6Oy} [20... [09:58:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I don't have the context on the mentioned dependency cycle, but the patch LGTM anyway." [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [09:59:27] matthiasmullie: Bah, the selenium run failed. [09:59:41] "Failed at the Wikibase@0.1.0 selenium-test script.", what a surprise. [10:00:24] :) [10:00:42] (03CR) 10David Caro: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:01:32] (03CR) 10jerkins-bot: [V: 04-1] Seperate between and controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie) [10:02:48] (03CR) 10Jforrester: [C: 03+2] "…" [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie) [10:05:00] Amir1: I'm literally deploying it now. :-) [10:05:18] James_F: ugh, I should have checked IRC before [10:05:20] (03PS1) 10Elukey: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) [10:05:23] * James_F grins. [10:05:24] awesome! [10:05:33] Thanks [10:05:45] * James_F grumbles about the new "miscweb" service destroying tab-completion of `/srv/m`. [10:06:10] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [10:06:32] Clearly we should have called it /srv/nanosites or whatever. [10:08:22] now I have to find out what miscweb is ;) [10:08:42] static-bugzilla, apparently? [10:09:27] wdqs gui [10:09:27] So far just that, yes. [10:09:36] Eventually a few more little things. [10:09:41] transparency reports, etc. [10:09:53] https://wikitech.wikimedia.org/wiki/Miscweb1002 [10:09:53] (03PS2) 10Elukey: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) [10:10:05] so a container to host static websites right? [10:10:11] https://wikitech.wikimedia.org/wiki/Microsites [10:10:12] Calling them "microsites" would be more industry-standard, but wouldn't help with tab completion. [10:10:23] most of those aren't in containers yet afaik [10:10:36] it's a ganeti VM [10:10:41] with apache [10:10:55] like people1002 basically [10:11:13] Yeah. [10:11:30] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [10:16:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30099/console" [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [10:18:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You should use require/include in the two dependent classes and then you can drop the if defined." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [10:19:36] <_joe_> mutante is working on moving microsites to k8s IIRC [10:24:37] PROBLEM - Memcached on mw1414 is CRITICAL: connect to address 10.64.0.160 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:25:07] (03Merged) 10jenkins-bot: Seperate between and controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie) [10:25:12] Aha. Finally. [10:26:09] matthiasmullie: Should be live on mwdebug2001. [10:26:15] Checking... [10:31:56] James_F: seems to work, can proceed! [10:32:24] Going. [10:33:11] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/WikibaseMediaInfo: UploadWizard/WikibaseMediaInfo fix 3fd2873 for [[phab:T285579|T285579]] (duration: 00m 59s) [10:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:20] T285579: "Add data" step of Upload Wizard broken - https://phabricator.wikimedia.org/T285579 [10:33:51] James_F: thanks a lot! [10:34:07] Any time. [10:34:13] No spike in errors that I see. [10:34:16] Calling this a success. [10:34:51] nice :) [10:35:40] (03PS1) 10Effie Mouzeli: network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910 [10:49:56] (03PS3) 10Giuseppe Lavagetto: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [10:49:58] (03PS1) 10Giuseppe Lavagetto: kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 [10:50:28] (03CR) 10Muehlenhoff: network::constants: add kubepods network constant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli) [10:51:42] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto) [10:51:47] PROBLEM - mediawiki-installation DSH group on mw1414 is CRITICAL: Host mw1414 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:56:10] (03CR) 10Effie Mouzeli: network::constants: add kubepods network constant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli) [10:56:52] (03PS2) 10Effie Mouzeli: network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910 [10:59:24] <_joe_> effie: do we really need to add a global def there? [11:00:54] <_joe_> can't we just join all the ones we need in the firewall rules definition? [11:01:13] <_joe_> CODFW_PRIVATE_PRIVATE1_KUBEPODS_CODFW and so on [11:01:21] _joe_: my thoughy was that it be needed more than once as we are adding new services [11:01:25] thought* [11:01:49] not just for maps hosts I mean [11:01:52] <_joe_> heh I was thinking it's better to explicitly list the clusters you give access to [11:02:05] <_joe_> anyhow, bbl sorry [11:02:19] moritzm: what is your opinion ? [11:03:17] (03PS3) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) [11:03:19] (03PS2) 10Hnowlan: maps: reimage maps1010 as buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702619 (https://phabricator.wikimedia.org/T269582) [11:09:27] effie: no strong opinion, but I think Alex said in the past that we should rather reduce network constants (except the major ones like production_networks) than adding new [11:10:38] alright, what I generally want is that next time someone needs to add those networks, to do it easily [11:11:13] what would be the alternative? those subnets are defined in modules/network/data/data.yaml [11:11:31] I think we would like to avoid defining them again [11:11:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [11:14:57] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [11:14:59] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [11:15:05] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [11:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:06] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [11:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:57] ACKNOWLEDGEMENT - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T285603 [11:22:00] (03PS1) 10Effie Mouzeli: profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 [11:24:18] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/30100/maps1009.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli) [11:28:34] !log powercycling mw2380, trying to make it boot [11:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:29] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/30101/" [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli) [11:31:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [11:35:55] (03PS1) 10Jelto: add mcrouter certs for mw1415.eqiad.wmnet to mw1421.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702923 (https://phabricator.wikimedia.org/T279309) [11:39:22] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:40:14] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:40:18] (03CR) 10Dzahn: [C: 03+1] add mcrouter certs for mw1415.eqiad.wmnet to mw1421.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702923 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [11:40:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [11:41:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:42:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:42:22] (03CR) 10Jelto: [V: 03+2 C: 03+2] add mcrouter certs for mw1415.eqiad.wmnet to mw1421.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702923 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [11:42:55] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:43:06] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:45:47] (Traffic bill over quota) firing: (5) Traffic bill over quota - https://alerts.wikimedia.org [11:48:59] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:50:34] ^ this already works again for me [11:51:08] !log mw2380 - PXE booting - does not boot from hard disk [11:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:13] RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [11:55:23] RECOVERY - SSH on mw2380 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:00:47] (Traffic bill over quota) firing: (5) Traffic bill over quota - https://alerts.wikimedia.org [12:02:09] PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:11] RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [12:03:43] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:05:47] (Traffic bill over quota) resolved: (4) Traffic bill over quota - https://alerts.wikimedia.org [12:06:24] !log mw2380 /puppetmaster: reimaged, revoking old cert, signing new cert, initial puppet run T285603 [12:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:33] T285603: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 [12:09:44] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:12:14] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10BTullis) Added myself to the root alias on puppetmaster1001 Added my GPG key to the pwstore repo. [12:13:46] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:24:01] !log added btullis to pwstore [12:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] (03PS1) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 [12:39:52] (03PS1) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) [12:40:31] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Thanks for the ping - this needs some thought from the DB side. We have some of our misc db masters on row A - db1159 m1 A6. Affected services: bacula (... [12:43:08] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @Bstorm @nskaggs please see above - we might need to depool the affected clouddb* hosts. [12:44:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) dbproxy1013 is the active proxy for m2. I will depool it next week. [12:44:09] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [12:44:39] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (owner: 10David Caro) [12:46:50] (03CR) 10Effie Mouzeli: "This could be a temp solution and find something better after alex is back" [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli) [12:47:45] RECOVERY - PHP7 jobrunner on mw2380 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:47:47] RECOVERY - MD RAID on mw2380 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:47:49] RECOVERY - PHP7 rendering on mw2380 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:47:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [12:48:26] (03PS1) 10Ladsgroup: dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/702933 (https://phabricator.wikimedia.org/T273673) [12:50:11] PROBLEM - memcached socket on mw2380 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [12:52:10] Krinkle: was the removal of the JS error dashboard in https://wikitech.wikimedia.org/w/index.php?title=Backport_windows/Deployers&diff=next&oldid=1916833 intentional? It seemed useful to me. [12:56:24] (03PS2) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) [12:56:26] (03PS2) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) [12:57:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli) [13:00:37] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) Hi @Marostegui thanks for the feedback. > Will this stop traffic on all switches at the same time? Or do you plan to d... [13:02:19] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Dzahn) I reimaged mw2380 and it is booting again now. [13:02:45] PROBLEM - mediawiki-installation DSH group on mw2380 is CRITICAL: Host mw2380 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:03:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10dcaro) Installed and put in active on netbox the servers 16, 17, 19 and 20. Waiting for 18 to be fixed :+1: [13:05:13] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) With my observability and swift maintainer hats on, I think we're ok to tolerate a network blip, specifically: * ms-be... [13:08:19] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2380.codfw.wmnet [13:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:43] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [13:08:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2380.codfw.wmnet [13:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193704, @cmooney wrote: > Hi @Marostegui thanks for the feedback. > >> Will this stop traffic on al... [13:09:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2380.codfw.wmnet [13:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] RECOVERY - mediawiki-installation DSH group on mw2380 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:11:47] !log mw2380 - rebooting [13:11:50] 10SRE, 10Performance-Team, 10Thumbor, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10AntiCompositeNumber) [13:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:39] PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) The proper fix is T263277. However there are 2 options to get data quickly and temporarily: The easiest and "cleanest"... [13:14:59] (03CR) 10Elukey: [C: 04-1] "pcc also fails on releases nodes, due to duplicate declaration:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:15:23] RECOVERY - memcached socket on mw2380 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [13:15:25] RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [13:15:28] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @cmooney do you know when you'll know how long this change can take? [13:15:33] PROBLEM - Memcached on mw1419 is CRITICAL: connect to address 10.64.0.94 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:15:34] PROBLEM - Memcached on mw1421 is CRITICAL: connect to address 10.64.0.158 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:15:37] PROBLEM - Memcached on mw1416 is CRITICAL: connect to address 10.64.0.11 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:15:47] PROBLEM - Memcached on mw1420 is CRITICAL: connect to address 10.64.0.155 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:15:51] PROBLEM - Memcached on mw1417 is CRITICAL: connect to address 10.64.0.91 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:16:27] PROBLEM - Memcached on mw1418 is CRITICAL: connect to address 10.64.0.92 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:16:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli) [13:19:13] PROBLEM - mediawiki-installation DSH group on mw1420 is CRITICAL: Host mw1420 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:19:47] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10BBlack) Traffic-related bits: * dns1001 will need a manual depool so that it doesn't have knock-on effects on all of the other clus... [13:20:42] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ayounsi) @Marostegui "as the standby host is on row A too" that sounds like SPOF to me and should be moved to a different row. Due... [13:20:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2380.codfw.wmnet [13:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:35] PROBLEM - mediawiki-installation DSH group on mw1415 is CRITICAL: Host mw1415 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:21:38] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193787, @ayounsi wrote: > @Marostegui "as the standby host is on row A too" that sounds like SPOF to me... [13:21:43] RECOVERY - Memcached on mw1414 is OK: TCP OK - 0.000 second response time on 10.64.0.160 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:21:53] PROBLEM - Memcached on mw1415 is CRITICAL: connect to address 10.64.0.9 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:22:04] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [13:22:06] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [13:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:13] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [13:22:14] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309 [13:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:34] tgr: it was. I don't think it should be a mandatory step, the page is too long, people weren't doing that anyway. It's still promoted among many on logstash home for quick acces [13:22:56] (03CR) 10Hnowlan: [C: 03+1] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli) [13:25:15] (03CR) 10Giuseppe Lavagetto: kubernetes: centralize the creation of /etc/kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:26:26] (03PS1) 10Jgreen: remove payments100[1-4] from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/702939 (https://phabricator.wikimedia.org/T286044) [13:27:59] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) @Marostegui Our only real option to test is on new switches due to be installed under T277340. We are working with DC-Ops... [13:28:04] (03CR) 10Jgreen: [C: 03+2] remove payments100[1-4] from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/702939 (https://phabricator.wikimedia.org/T286044) (owner: 10Jgreen) [13:29:11] RECOVERY - Memcached on mw1415 is OK: TCP OK - 0.003 second response time on 10.64.0.9 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:29:27] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Ok, I think what we can do from our side is to get the replacement hosts ready but without failing over things to them,... [13:29:51] RECOVERY - Memcached on mw1416 is OK: TCP OK - 0.000 second response time on 10.64.0.11 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:29:57] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo) Speaking on behalf of: ` dbprov1001 ms-backup1001 db1116 ` That could cause ongoing backup runs to fail, but that is "norm... [13:30:07] RECOVERY - Memcached on mw1417 is OK: TCP OK - 0.000 second response time on 10.64.0.91 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:30:13] RECOVERY - Memcached on mw1420 is OK: TCP OK - 7.073 second response time on 10.64.0.155 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:30:51] RECOVERY - Memcached on mw1418 is OK: TCP OK - 0.000 second response time on 10.64.0.92 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:31:37] RECOVERY - Memcached on mw1421 is OK: TCP OK - 0.000 second response time on 10.64.0.158 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:31:37] RECOVERY - Memcached on mw1419 is OK: TCP OK - 0.000 second response time on 10.64.0.94 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:32:07] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=registry200[5-8].codfw.wmnet,dc=codfw,cluster=docker-registry [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:42] (03PS4) 10Giuseppe Lavagetto: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:34:44] (03PS2) 10Giuseppe Lavagetto: kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 [13:35:35] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto) [13:41:42] 10SRE, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10Jgiannelos) Chromium-render is now completely moved on native prometheus metrics using service runner. There were many incompatibilities any many broken... [13:43:09] (03PS1) 10JMeybohm: Revert "Add 4 new docker-regisry nodes in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/702947 (https://phabricator.wikimedia.org/T286046) [13:46:03] (03PS2) 10JMeybohm: Revert "Add 4 new docker-regisry nodes in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/702947 (https://phabricator.wikimedia.org/T286046) [13:49:24] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [13:49:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30105/console" [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto) [13:51:19] (03PS3) 10Effie Mouzeli: network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910 [13:54:07] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry[2005-2008].codfw.wmnet [13:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] (03CR) 10Effie Mouzeli: [C: 03+2] network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli) [13:56:04] (03CR) 10Effie Mouzeli: [C: 03+2] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli) [13:57:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:47] (03PS2) 10Effie Mouzeli: profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 [13:59:14] (03CR) 10jerkins-bot: [V: 04-1] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli) [13:59:34] (03PS3) 10Effie Mouzeli: profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 [14:00:13] (03PS4) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) [14:01:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=docker-registry site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:01:45] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) 05Open→03Resolved Thank you! [14:02:11] (03CR) 10Elukey: [C: 03+2] kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [14:02:45] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto) [14:04:20] (03CR) 10Effie Mouzeli: [C: 03+2] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli) [14:12:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry[2005-2008].codfw.wmnet [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:59] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw141[4-9].eqiad.wmnet [14:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:40] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw142[0-1].eqiad.wmnet [14:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:11] !log jelto@cumin1001 conftool action : set/weight=30; selector: name=mw141[4-9].eqiad.wmnet [14:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:43] !log jelto@cumin1001 conftool action : set/weight=30; selector: name=mw142[0-1].eqiad.wmnet [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:37] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw141[4-9].eqiad.wmnet [14:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:59] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw142[0-1].eqiad.wmnet [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:27] (03CR) 10JMeybohm: [C: 03+2] Revert "Add 4 new docker-regisry nodes in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/702947 (https://phabricator.wikimedia.org/T286046) (owner: 10JMeybohm) [14:31:45] (03PS5) 10JMeybohm: dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T286054) [14:31:51] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) @jijiki it would not justify such a huge performance shift, by any measure. I am even veering towards disabling onhost memcached, for the latest discoveries of bad interactions with... [14:38:06] !log kormat@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbstore1004.eqiad.wmnet [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:47] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw141[4-9].eqiad.wmnet [14:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:08] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw142[0-1].eqiad.wmnet [14:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:44] 10SRE, 10vm-requests: eqiad: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T286057 (10JMeybohm) [14:48:55] 10SRE, 10vm-requests: eqiad: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T286057 (10JMeybohm) a:03JMeybohm [14:52:25] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbstore1004.eqiad.wmnet [14:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:55] !log kormat@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbstore1004.eqiad.wmnet [14:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:41] !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm for new host dragonfly-supernode1001.eqiad.wmnet [14:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:19] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) Agreed option 1 seems easier and safer than option 2, the sampling isn't great but not the end of the world if we're... [14:56:55] RECOVERY - mediawiki-installation DSH group on mw1415 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:56:55] RECOVERY - mediawiki-installation DSH group on mw1414 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:56:55] RECOVERY - mediawiki-installation DSH group on mw1420 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:58:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:58:28] jelto: ^ very good. nice work, cy later [15:01:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) [15:01:22] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) @MoritzMuehlenhoff Is there any updates on substituting in noto-cjk? The files are still not loading the correct preview characters. Sample of Unicode characters missing in svg: `鿬鿫... [15:02:59] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbstore1004.eqiad.wmnet [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:23] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host dragonfly-supernode1001.eqiad.wmnet [15:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm for new host dragonfly-supernode1001.eqiad.wmnet [15:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:00] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:15:58] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [15:17:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [15:17:06] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dragonfly-supernode1001.eqiad.wmnet [15:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:06] (03PS1) 10JMeybohm: dragonfly: Remove fetching of $docker_registry_fqdn cert [puppet] - 10https://gerrit.wikimedia.org/r/702979 (https://phabricator.wikimedia.org/T286054) [15:25:18] (03PS1) 10JMeybohm: site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/702982 (https://phabricator.wikimedia.org/T286057) [15:26:27] (03PS1) 10Elukey: profile::kubernetes::node: add hiera config to expose puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) [15:28:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30108/console" [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:28:38] (03CR) 10Elukey: "The alternative to this could be a profile with a separate hiera config, that deploys certs if needed." [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:29:09] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] (03PS1) 10Hnowlan: maps: standardised the maps2.0 config in eqiad, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702984 (https://phabricator.wikimedia.org/T269582) [15:35:11] (03CR) 10Elukey: "Ah no of course this is another use case of different users owning files, will need to dig a bit more" [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:37:23] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [15:40:15] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [15:40:36] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) I created a first basic dashboard for the mwdebug deployment and I noticed what the major issue was immediately: I dedicated just 2k maximum opcache scripts, which bottomed out even... [15:41:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:44:05] (03PS2) 10JMeybohm: site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/702982 (https://phabricator.wikimedia.org/T286057) [15:48:34] (03PS5) 10Hnowlan: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [15:49:56] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30110/console" [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [15:53:05] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10dancy) >>! In T284772#7193169, @MoritzMuehlenhoff wrote: > Ok, thanks we can bump vcpus to 8, but per the host metrics doubling RAM wouldn't seem to make any measurable difference, right? I suspect that more... [15:54:13] !log kormat@cumin1001 START - Cookbook sre.dns.netbox [15:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:15] RECOVERY - NFS Share Volume Space /srv/scratch on cloudstore1008 is OK: DISK OK - free space: /srv/scratch 1070613 MB (27% inode=99%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [15:59:33] !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:45] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [16:03:16] (03CR) 10Hnowlan: [V: 03+1 C: 03+1] "I think this is ready to go once we're back from holiday." [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [16:09:22] (03CR) 10Legoktm: [C: 03+1] varnish: use 403 instead of 429 where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [16:11:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Andrew Just a heads up that cloudcontrol1003 is in the list. It might be fine and will catch up, but it also could crash r... [16:11:10] (03PS1) 10Kormat: dbstore1004: Rename to db1183, prep for m5. [puppet] - 10https://gerrit.wikimedia.org/r/702988 (https://phabricator.wikimedia.org/T284622) [16:11:52] (03PS2) 10Kormat: dbstore1004: Rename to db1183, prep for m5. [puppet] - 10https://gerrit.wikimedia.org/r/702988 (https://phabricator.wikimedia.org/T284622) [16:13:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) [16:13:20] (03CR) 10Kormat: [C: 03+2] dbstore1004: Rename to db1183, prep for m5. [puppet] - 10https://gerrit.wikimedia.org/r/702988 (https://phabricator.wikimedia.org/T284622) (owner: 10Kormat) [16:13:49] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Ottomata One of the cloudbs is clouddb1021. FYI. I understand you likely won't be using it that late in the month, but I w... [16:14:21] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) lists1001 is a SPOF currently, we'll probably just announce a downtime when we get closer to the actual time [16:28:06] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:29:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:31:11] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/30111/" [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [16:35:48] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Ladsgroup) I can draft an announcement for downtime of lists.wikimedia.org, maybe we can use the time to increase its capacity (mor... [16:36:23] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:37:06] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm) [16:44:16] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [16:52:44] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Next steps: * refactor how `base::expose_puppet_certs` is used in kubernetes profiles, since if profile::... [16:54:27] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10nskaggs) Impacted clouddb's will be clouddb1013, clouddb1014, clouddb1021. I believe interrupting traffic on 2 of 4 of the "web" r... [16:55:26] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:56:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:59:26] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) @dcaro and @Andrew on the ceph and cloudvirts, I have concerns. We've seen that a lack of network to enough OSDs for a while will cause problems, and the cluster can... [17:02:19] 10SRE, 10Wikimedia-Mailing-lists: Set up spare lists host in codfw, document failover procedure - https://phabricator.wikimedia.org/T286071 (10Legoktm) [17:17:10] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) I'll make a meeting for our team to discuss. There is a ticket for row B as well :) [17:19:08] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10Legoktm) thanks @AntiCompositeNumber for fixing the image and the additional detail. I agree with declining t... [17:21:11] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10Legoktm) Would it be useful if I dug through the logs for other such files that don't thumb correctly? [17:23:07] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10faidon) >>! In T232343#7058654, @herron wrote: > **Lists:** Lists/mailman has an internet facing exim inst... [17:27:52] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10wkandek) Is this the dashboard? https://grafana.wikimedia.org/d/U7JT--knk/joe-k8s-mwdebug?viewPanel=70&orgId=1&from=1625227688488&to=1625246654342 [17:30:54] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "> tail -n +3 means, output last lines starting with line number 3 (so skip the first two)." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:31:08] (03PS3) 10Brennen Bearnes: fix cleanup of config backups, make script more robust [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:31:32] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] fix cleanup of config backups, make script more robust [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:35:53] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:57:29] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:28] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ayounsi) [18:02:52] 10SRE, 10DNS, 10Traffic: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10HMarcus) 05Open→03Resolved a:03HMarcus Thanks, this can be closed. [18:04:38] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:24] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10Legoktm) >>! In T232343#7194626, @faidon wrote: > While some of them could be mitigated (e.g. separate exi... [18:08:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:59] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:22:25] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:05] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) It would also be nice if the cookbook could check all services, and then fail if at least one didn't verify... [18:52:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Cmjohnson) [19:35:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Cmjohnson) All are finished with on-site tasks, the raid configuration was also completed. [19:49:25] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:25] (03PS1) 10Bstorm: jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 [20:20:33] (03PS1) 10Bstorm: jessie deprecation: don't build jessie containers when rebuilding [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005 [20:28:25] (03CR) 10Bstorm: "Note: this is how I built the container that we are using." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm) [21:28:13] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/699471 (https://phabricator.wikimedia.org/T267683) (owner: 10Bstorm) [21:33:01] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10brennen) [21:33:15] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10brennen) 05Open→03Resolved Port is open and in use. > Sure you want to open ssh to the public before backups and l... [22:06:15] !log removing three files for legal compliance [22:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:30] (03CR) 10BryanDavis: [C: 03+1] jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm) [22:20:11] (03CR) 10Bstorm: [C: 03+2] jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm) [22:20:42] (03Merged) 10jenkins-bot: jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm) [23:48:04] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [23:48:57] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) 05Resolved→03Open Reopening to follow up on the failure to fully serve all the static files. To followup...