[00:02:15] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] cirrus: systemd timer for readahead script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[00:07:52] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:14] <icinga-wm>	 PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[00:27:06] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:27:26] <icinga-wm>	 RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[00:32:47] <jinxer-wm>	 (Traffic bill over quota) firing: (4) Traffic bill over quota   - https://alerts.wikimedia.org
[00:33:50] <wikibugs>	 (03PS1) 10H.krishna123: [WIP] api_db: Add code to enable database connection Added code to connect to an SQL database, added skeleton for unit tests, cleaned up main.py file and added a singleton class to keep database configuration the same throughout the program. Added DB query functionality for the readiness probe [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142)
[00:34:41] <wikibugs>	 (03PS2) 10H.krishna123: [WIP] api_db: Add code to enable database connection [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142)
[00:34:46] <wikibugs>	 (03PS12) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053)
[00:36:07] <wikibugs>	 (03CR) 10Ryan Kemper: cirrus: systemd timer for readahead script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[00:37:47] <jinxer-wm>	 (Traffic bill over quota) firing: (7) Traffic bill over quota   - https://alerts.wikimedia.org
[00:52:47] <jinxer-wm>	 (Traffic bill over quota) firing: (7) Traffic bill over quota   - https://alerts.wikimedia.org
[00:57:47] <jinxer-wm>	 (Traffic bill over quota) resolved: (3) Traffic bill over quota   - https://alerts.wikimedia.org
[01:16:29] <legoktm>	 !log uploaded elasticsearch-madvise 0.1 to apt.wm.o (T264053)
[01:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:16:39] <stashbot>	 T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053
[01:26:59] <wikibugs>	 (03PS13) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053)
[01:28:35] <wikibugs>	 (03CR) 10Ryan Kemper: cirrus: systemd timer for readahead script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[01:36:33] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[01:39:07] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702785
[01:41:16] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702785 (owner: 10Ryan Kemper)
[01:46:44] <icinga-wm>	 PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:48] <icinga-wm>	 PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic2048 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:14] <icinga-wm>	 PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:14] <icinga-wm>	 PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:22] <icinga-wm>	 PROBLEM - Check systemd state on elastic2049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:28] <icinga-wm>	 PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:28] <icinga-wm>	 PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic2039 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:32] <icinga-wm>	 PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic2041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:36] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:38] <icinga-wm>	 PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:38] <icinga-wm>	 PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:42] <icinga-wm>	 PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:42] <icinga-wm>	 PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:48] <icinga-wm>	 PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:48] <icinga-wm>	 PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:50] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:52] <icinga-wm>	 PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:58] <icinga-wm>	 PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:58] <icinga-wm>	 PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:58] <icinga-wm>	 PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:08] <icinga-wm>	 PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:08] <icinga-wm>	 PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:14] <icinga-wm>	 PROBLEM - Check systemd state on elastic2029 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:14] <icinga-wm>	 PROBLEM - Check systemd state on elastic1062 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:16] <icinga-wm>	 PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:16] <icinga-wm>	 PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:18] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:22] <icinga-wm>	 PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:24] <icinga-wm>	 PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic1041 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:34] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04713 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[01:49:38] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:40] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:40] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:44] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:44] <icinga-wm>	 PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:44] <icinga-wm>	 PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:48] <icinga-wm>	 PROBLEM - Check systemd state on elastic2042 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic2028 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:50] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:50] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:54] <icinga-wm>	 PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:56] <icinga-wm>	 PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:04] <icinga-wm>	 PROBLEM - Check systemd state on elastic1038 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:14] <icinga-wm>	 PROBLEM - Check systemd state on elastic1033 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:21] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: fix broken module path [puppet] - 10https://gerrit.wikimedia.org/r/702788 (https://phabricator.wikimedia.org/T264053)
[01:50:22] <icinga-wm>	 PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic1036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:44] <ryankemper>	 ^ Fixing this right now
[01:51:01] <ryankemper>	 Sorry for the wall of text
[01:51:52] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] cirrus: fix broken module path [puppet] - 10https://gerrit.wikimedia.org/r/702788 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[01:59:06] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Aklapp...
[02:02:12] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert 1c99db9965361cdf95f042bb2401e86733a31393 [puppet] - 10https://gerrit.wikimedia.org/r/702790 (https://phabricator.wikimedia.org/T264053)
[02:02:43] <ryankemper>	 `elasticsearch-madvise` seems to have failed to install. I think it's something simple, but getting my changes reverted first: https://gerrit.wikimedia.org/r/c/operations/puppet/+/702790/
[02:05:23] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Revert 1c99db9965361cdf95f042bb2401e86733a31393 [puppet] - 10https://gerrit.wikimedia.org/r/702790 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[02:09:02] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:42] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:44] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:46] <icinga-wm>	 RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:46] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:48] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:48] <icinga-wm>	 RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:56] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:56] <icinga-wm>	 RECOVERY - Check systemd state on elastic2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:56] <icinga-wm>	 RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:58] <icinga-wm>	 RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:58] <icinga-wm>	 RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:58] <icinga-wm>	 RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:00] <icinga-wm>	 RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:24] <icinga-wm>	 RECOVERY - Check systemd state on elastic2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:26] <icinga-wm>	 RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:26] <icinga-wm>	 RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:30] <icinga-wm>	 RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:30] <icinga-wm>	 RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:32] <icinga-wm>	 RECOVERY - Check systemd state on elastic1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:34] <icinga-wm>	 RECOVERY - Check systemd state on elastic2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:34] <icinga-wm>	 RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:38] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:38] <icinga-wm>	 RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:38] <icinga-wm>	 RECOVERY - Check systemd state on elastic2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:38] <icinga-wm>	 RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:38] <icinga-wm>	 RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:40] <icinga-wm>	 RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:40] <icinga-wm>	 RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:42] <icinga-wm>	 RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:46] <icinga-wm>	 RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:46] <icinga-wm>	 RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:46] <icinga-wm>	 RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:50] <icinga-wm>	 RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:50] <icinga-wm>	 RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:52] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:56] <icinga-wm>	 RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:58] <icinga-wm>	 RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:00] <icinga-wm>	 RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:02] <icinga-wm>	 RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:02] <icinga-wm>	 RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:10] <icinga-wm>	 RECOVERY - Check systemd state on elastic1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:18] <icinga-wm>	 RECOVERY - Check systemd state on elastic1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:18] <icinga-wm>	 RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:20] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:24] <icinga-wm>	 RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:26] <icinga-wm>	 RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:30] <icinga-wm>	 RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:32] <icinga-wm>	 RECOVERY - Check systemd state on elastic1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:54] * ryankemper forgot that the host I ran puppet to test the changes on still had the manually installed dpkg thus didn't catch the mistake before running puppet across the fleet...classic
[02:13:13] * ryankemper shakes fist at the general concept of state
[02:13:14] <ryankemper>	 alright we're back to a good state now
[02:14:30] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001149 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[02:16:30] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702791 (https://phabricator.wikimedia.org/T264053)
[02:17:37] <wikibugs>	 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10AntiCompositeNumber) p:05High→03Medium >>! In T285875#7189001, @Legoktm wrote: > T226318#5282215 suggests...
[02:20:17] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702791 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[02:35:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-mediawiki-private.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:13] <legoktm>	 !log uploaded elasticsearch-madvise_0.1~deb9u1_amd64.changes to stretch-wikimedia on apt1001
[03:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:27] <ryankemper>	 !log T264053 `sudo -E cumin 'P:elasticsearch::cirrus' 'sudo disable-puppet "verify new deb package works - T264053"'`
[03:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:34] <stashbot>	 T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053
[03:05:38] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702791 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)
[03:06:16] <ryankemper>	 !log T264053 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/702791; will run puppet on single host
[03:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:07:04] <ryankemper>	 !log T264053 `ryankemper@elastic2054:~$ sudo run-puppet-agent --force`
[03:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:07:57] <ryankemper>	 !log T264053 `Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install elasticsearch-madvise' returned 100: Reading package lists...` grr
[03:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:11:55] <ryankemper>	 !log T264053 `sudo -E cumin 'P:elasticsearch::cirrus' 'sudo apt update'` fixed the issue
[03:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:12:03] <stashbot>	 T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053
[03:14:57] <ryankemper>	 !log T264053 `sudo -E cumin 'P:elasticsearch::cirrus' 'sudo run-puppet-agent --force'`
[03:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:49:31] <wikibugs>	 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10AntiCompositeNumber) 05Open→03Declined
[04:29:32] <wikibugs>	 (03PS1) 10Marostegui: db1129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702793
[04:30:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1129: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702793 (owner: 10Marostegui)
[05:11:54] <wikibugs>	 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Aklapper) a:05jgleeson→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sen...
[05:12:16] <wikibugs>	 10SRE: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Ple...
[05:14:32] <wikibugs>	 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please assign th...
[05:15:07] <wikibugs>	 10SRE: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`)....
[05:15:17] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10Aklapper) a:05RobH→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee o...
[05:17:12] <wikibugs>	 10SRE, 10DNS, 10Traffic: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10Aklapper) a:05mark→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please a...
[05:17:49] <wikibugs>	 10SRE: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10Aklapper) a:05Ottomata→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please assign this task to...
[05:18:09] <wikibugs>	 10SRE, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (s...
[05:18:42] <wikibugs>	 10SRE, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Plea...
[05:19:12] <wikibugs>	 10SRE, 10Scap, 10serviceops, 10Goal, 10User-jijiki: SRE FY2019 Q3:TEC6: First steps towards Canary Deployments - https://phabricator.wikimedia.org/T213156 (10Aklapper) a:05jijiki→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails s...
[05:20:49] <wikibugs>	 10SRE, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years...
[05:21:18] <wikibugs>	 10SRE, 10Discovery-Search: Collect per-node latency statistics from each node separately - https://phabricator.wikimedia.org/T204982 (10Aklapper) a:05EBernhardson→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May...
[05:22:15] <wikibugs>	 10SRE, 10Traffic: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and...
[05:22:38] <wikibugs>	 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Aklapper) a:05Vgutierrez→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee...
[05:22:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-Needs-Improvement: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to...
[05:23:24] <wikibugs>	 10SRE, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to...
[05:24:50] <wikibugs>	 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please assign this t...
[05:47:26] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) @papaul thank you!
[06:08:58] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) It appears that the host gets stuck at {F34535299}, probably something got messed up with the boot order
[06:09:12] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) 05Resolved→03Open
[06:19:07] <wikibugs>	 (03CR) 10Jcrespo: "See my comments below." (033 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123)
[06:42:26] <wikibugs>	 (03CR) 10Jcrespo: "I strongly suggest you run a python code linter when developing- it will solve you many headaches early on, and most likely we will want t" [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210702T0700)
[07:10:12] <wikibugs>	 (03PS6) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442)
[07:10:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo)
[07:12:11] <wikibugs>	 10SRE, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Automated restarts are in place for most services and everything else is ongoing fine-tuning and add...
[07:25:55] <dcausse>	 !log installing openjdk-8-dbg on wdqs1013
[07:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:26] <wikibugs>	 (03PS7) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442)
[07:27:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo)
[07:29:10] <wikibugs>	 (03PS8) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442)
[07:38:29] <wikibugs>	 (03PS9) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442)
[07:44:22] <wikibugs>	 (03CR) 10Jcrespo: "@jbond when you have time (not a priority) please review my wiki edits at https://wikitech.wikimedia.org/w/index.php?title=Puppet%2FWmflib" [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo)
[07:46:29] <kostajh>	 thcipriani greg-g brennen: help! I'd like to do an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/702780 -- context is T285996
[07:46:30] <stashbot>	 T285996: [regression-wmf12] new accounts do not get GrowthExperiments features - https://phabricator.wikimedia.org/T285996
[07:49:30] <wikibugs>	 (03CR) 10Jcrespo: "I think I attended all comments, this is what the new (single) rule does:" [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo)
[07:49:59] <kostajh>	 (are the only people who can approve an emergency deploy based in Americas timezones?)
[07:53:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1267.eqiad.wmnet
[07:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1268.eqiad.wmnet
[07:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:09] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1267.eqiad.wmnet
[07:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1268.eqiad.wmnet
[07:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:59] <wikibugs>	 (03CR) 10JMeybohm: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[07:56:23] <wikibugs>	 (03CR) 10Elukey: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[08:03:45] <moritzm>	 !log installing ipmitool security updates
[08:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:25] <wikibugs>	 (03PS1) 10Dzahn: site/install/conftool: decom mw1267, mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/702879 (https://phabricator.wikimedia.org/T280203)
[08:11:32] <wikibugs>	 (03PS1) 10Jelto: site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309)
[08:12:02] <wikibugs>	 (03CR) 10Jelto: "please take a look" [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[08:12:15] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1267 is CRITICAL: Host mw1267 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:14:31] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[08:14:54] <wikibugs>	 (03Abandoned) 10Muehlenhoff: conf: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/702101 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[08:15:55] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1267 is CRITICAL: Host mw1267 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:17:00] <wikibugs>	 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) Status update: the error isn't new (as in, it didn't start appearing on Jun 27th) and thanos-sidecar also sometimes experiences the error. We have sporadic errors dat...
[08:18:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm +1" [puppet] - 10https://gerrit.wikimedia.org/r/702879 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn)
[08:22:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me, but we should add icinga downtimes, then merge.. and then they need to be added to conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[08:23:36] <wikibugs>	 (03PS4) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926)
[08:24:00] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[08:24:02] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[08:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:40] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[08:24:41] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[08:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/install/conftool: decom mw1267, mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/702879 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn)
[08:26:37] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1267.eqiad.wmnet
[08:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:31] <wikibugs>	 (03PS2) 10Jelto: site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309)
[08:31:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[08:32:53] <wikibugs>	 (03PS1) 10Kosta Harlan: Fix handling of geEnabled flag [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702808 (https://phabricator.wikimedia.org/T285996)
[08:36:54] <wikibugs>	 (03PS3) 10Jelto: site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309)
[08:37:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] site: add eight appsevers in eqiad row A3 [puppet] - 10https://gerrit.wikimedia.org/r/702880 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[08:45:33] <wikibugs>	 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) Also WRT the upstream bug https://bugs.launchpad.net/swift/+bug/1636663, it could be related but I couldn't get positive confirmation: there are connection timeouts (...
[08:46:09] <wikibugs>	 (03PS1) 10Volans: Use IcingaHosts instead of Icinga (analytics) [cookbooks] - 10https://gerrit.wikimedia.org/r/702883
[08:46:11] <wikibugs>	 (03PS1) 10Volans: Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884
[08:46:13] <wikibugs>	 (03PS1) 10Volans: Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885
[08:46:15] <wikibugs>	 (03PS1) 10Volans: iUse IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886
[08:49:37] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:50:46] <wikibugs>	 (03CR) 10Elukey: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[08:51:33] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:54:56] <moritzm>	 !log installing  golang-docker-credential-helpers security updates
[08:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:09] <tgr>	 !log deploying emergency backport: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/702808
[09:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:37] <moritzm>	 !log installing node-hosted-git-info security updates
[09:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro)
[09:04:17] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] "Emergency backport per T285996#7192814" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702808 (https://phabricator.wikimedia.org/T285996) (owner: 10Kosta Harlan)
[09:04:20] <mutante>	 !log decom'ing mw1267
[09:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:12] <wikibugs>	 (03PS1) 10Jelto: add mcrouter certs for mw1414.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702888 (https://phabricator.wikimedia.org/T279309)
[09:06:41] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] add mcrouter certs for mw1414.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702888 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[09:07:04] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] add mcrouter certs for mw1414.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702888 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[09:08:49] <wikibugs>	 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff)
[09:14:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1267.eqiad.wmnet
[09:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:01] <wikibugs>	 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13  (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1267.eqiad.wmnet` - m...
[09:17:38] <wikibugs>	 (03PS5) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926)
[09:19:16] <dcausse>	 !log restart blazegraph on wdqs1013
[09:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:44] <wikibugs>	 (03PS1) 10David Caro: Revert "ceph.keyring: ensure that the bootstrap dir exists" [puppet] - 10https://gerrit.wikimedia.org/r/702892
[09:21:15] <wikibugs>	 (03CR) 10Elukey: ml_k8s::master: add profile::kubernetes::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[09:21:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "This is breaking current runs on ceph machines." [puppet] - 10https://gerrit.wikimedia.org/r/702892 (owner: 10David Caro)
[09:24:00] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0115 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[09:24:15] <godog>	 !log test thanos 0.21.1 locally on thanos-fe2001 and depool the host - T285835
[09:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:23] <stashbot>	 T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835
[09:24:36] <icinga-wm>	 PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.514e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:25:40] <wikibugs>	 (03Merged) 10jenkins-bot: Fix handling of geEnabled flag [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702808 (https://phabricator.wikimedia.org/T285996) (owner: 10Kosta Harlan)
[09:28:20] <tgr>	 kostajh: it's on mwdebug2001
[09:29:04] <kostajh>	 tgr: having a look
[09:30:07] <kostajh>	 tgr: I created an account on cswiki, and got welcome survey + homepage
[09:32:16] <wikibugs>	 10SRE, 10User-MoritzMuehlenhoff: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10MoritzMuehlenhoff)
[09:32:18] <kostajh>	 tgr: I also confirmed that geEnabled=1 & campaign redirects straight to homepage, while geEnabled=0 switches features off
[09:32:22] <kostajh>	 so, lgtm
[09:32:30] <icinga-wm>	 RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.002844 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:32:40] <wikibugs>	 10SRE, 10User-MoritzMuehlenhoff: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10MoritzMuehlenhoff)
[09:34:15] <wikibugs>	 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10MoritzMuehlenhoff)
[09:34:48] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/30095/" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[09:36:02] <tgr>	 yeah, same here. I also got a WS-but-no-homepage with no extra flags so the randomization seems to work.
[09:36:09] <wikibugs>	 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) The main person working on this is Kunal and he was busy with deploying shellbox for Score l...
[09:36:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1268.eqiad.wmnet
[09:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:38] <wikibugs>	 (03PS1) 10Matthias Mullie: Seperate between <submit> and <edit> controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579)
[09:37:45] <logmsgbot>	 !log tgr@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/GrowthExperiments/includes/HomepageHooks.php: Backport: [[gerrit:702808|Fix handling of geEnabled flag (T285996)]] (duration: 00m 57s)
[09:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:52] <stashbot>	 T285996: [regression-wmf12] new accounts do not get GrowthExperiments features - https://phabricator.wikimedia.org/T285996
[09:40:18] <matthiasmullie>	 help! I too have an emergency deployment request (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/702810 - An UploadWizard step is significantly unusable, and I only now realized there will be no deployments next week...)
[09:41:23] <James_F>	 matthiasmullie: Happy to deploy if there's an SRE around to confirm it's OK to deploy now.
[09:41:57] <Lucas_WMDE>	 +1, I was also thinking that might be worth a backport
[09:42:29] <wikibugs>	 (03PS1) 10Ema: varnish: use 403 instead of 429 where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891)
[09:43:17] <James_F>	 All emergency contacts this week are on or near the West Coast, sadly.
[09:44:38] <James_F>	 Ping to thcipriani greg-g brennen dduvall for an emergency deploy request for T285579 for matthiasmullie.
[09:44:39] <stashbot>	 T285579: "Add data" step of Upload Wizard broken - https://phabricator.wikimedia.org/T285579
[09:48:37] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1268.eqiad.wmnet
[09:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:45] <wikibugs>	 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13  (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1268.eqiad.wmnet` - m...
[09:49:35] <elukey>	 James_F, Lucas_WMDE - earlier on we had another emergency deployment, if the fix's scope is contained/not-too-broad I'd say that we can proceed. If possible let's do it now so more people are around (not later on in the afternoon)
[09:50:36] <wikibugs>	 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10MoritzMuehlenhoff) >>! In T284772#7187773, @dancy wrote: > Most of the time releases1002 isn't doing much, just waiting for jobs to be triggered.  One of the jobs (mediawiki-config-pipeline-wmf-publish) curre...
[09:51:34] <wikibugs>	 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) This is done, isn't it? The performance issues are being mitigated by migrating to nginx light I think (someone needs to double check)
[09:52:36] <James_F>	 Fine, let's do it.
[09:52:47] <matthiasmullie>	 yay!
[09:52:48] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Seperate between <submit> and <edit> controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie)
[09:52:56] <wikibugs>	 (03PS1) 10David Caro: ceph.keyring: make sure the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858)
[09:53:11] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005754 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[09:54:53] <wikibugs>	 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) TLSv1.3 is  up & running, performance issues are being mitigated by replacing ats-tls with envoy or haproxy in the short term :)
[09:54:55] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] ceph.keyring: make sure the bootstrap dir exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro)
[09:55:12] <wikibugs>	 (03PS2) 10David Caro: ceph.keyring: make sure the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858)
[09:55:14] <wikibugs>	 (03CR) 10David Caro: ceph.keyring: make sure the bootstrap dir exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro)
[09:55:18] <wikibugs>	 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) >>! In T285835#7193102, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/BIyIZnoBa_6PSCT9b6Oy} [20...
[09:58:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I don't have the context on the mentioned dependency cycle, but the patch LGTM anyway." [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro)
[09:59:27] <James_F>	 matthiasmullie: Bah, the selenium run failed.
[09:59:41] <James_F>	 "Failed at the Wikibase@0.1.0 selenium-test script.", what a surprise.
[10:00:24] <matthiasmullie>	 :)
[10:00:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/702897 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro)
[10:01:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Seperate between <submit> and <edit> controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie)
[10:02:48] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "…" [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie)
[10:05:00] <James_F>	 Amir1: I'm literally deploying it now. :-)
[10:05:18] <Amir1>	 James_F: ugh, I should have checked IRC before
[10:05:20] <wikibugs>	 (03PS1) 10Elukey: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927)
[10:05:23] * James_F grins.
[10:05:24] <Amir1>	 awesome!
[10:05:33] <Amir1>	 Thanks
[10:05:45] * James_F grumbles about the new "miscweb" service destroying tab-completion of `/srv/m<tab>`.
[10:06:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[10:06:32] <James_F>	 Clearly we should have called it /srv/nanosites or whatever.
[10:08:22] <hashar>	 now I have to find out what miscweb is ;)
[10:08:42] <Lucas_WMDE>	 static-bugzilla, apparently?
[10:09:27] <Amir1>	 wdqs gui
[10:09:27] <James_F>	 So far just that, yes.
[10:09:36] <James_F>	 Eventually a few more little things.
[10:09:41] <Amir1>	 transparency reports, etc.
[10:09:53] <majavah>	 https://wikitech.wikimedia.org/wiki/Miscweb1002
[10:09:53] <wikibugs>	 (03PS2) 10Elukey: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927)
[10:10:05] <hashar>	 so a container to host static websites right?
[10:10:11] <Amir1>	 https://wikitech.wikimedia.org/wiki/Microsites
[10:10:12] <James_F>	 Calling them "microsites" would be more industry-standard, but wouldn't help with tab completion.
[10:10:23] <majavah>	 most of those aren't in containers yet afaik
[10:10:36] <Amir1>	 it's a ganeti VM
[10:10:41] <Amir1>	 with apache
[10:10:55] <Amir1>	 like people1002 basically
[10:11:13] <James_F>	 Yeah.
[10:11:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)
[10:16:19] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30099/console" [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[10:18:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You should use require/include in the two dependent classes and then you can drop the if defined." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[10:19:36] <_joe_>	 mutante is working on moving microsites to k8s IIRC
[10:24:37] <icinga-wm>	 PROBLEM - Memcached on mw1414 is CRITICAL: connect to address 10.64.0.160 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:25:07] <wikibugs>	 (03Merged) 10jenkins-bot: Seperate between <submit> and <edit> controls [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702810 (https://phabricator.wikimedia.org/T285579) (owner: 10Matthias Mullie)
[10:25:12] <James_F>	 Aha. Finally.
[10:26:09] <James_F>	 matthiasmullie: Should be live on mwdebug2001.
[10:26:15] <matthiasmullie>	 Checking...
[10:31:56] <matthiasmullie>	 James_F: seems to work, can proceed!
[10:32:24] <James_F>	 Going.
[10:33:11] <logmsgbot>	 !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/WikibaseMediaInfo: UploadWizard/WikibaseMediaInfo fix 3fd2873 for [[phab:T285579|T285579]] (duration: 00m 59s)
[10:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:20] <stashbot>	 T285579: "Add data" step of Upload Wizard broken - https://phabricator.wikimedia.org/T285579
[10:33:51] <matthiasmullie>	 James_F: thanks a lot!
[10:34:07] <James_F>	 Any time.
[10:34:13] <James_F>	 No spike in errors that I see.
[10:34:16] <James_F>	 Calling this a success.
[10:34:51] <elukey>	 nice :)
[10:35:40] <wikibugs>	 (03PS1) 10Effie Mouzeli: network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910
[10:49:56] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[10:49:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912
[10:50:28] <wikibugs>	 (03CR) 10Muehlenhoff: network::constants: add kubepods network constant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli)
[10:51:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto)
[10:51:47] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1414 is CRITICAL: Host mw1414 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[10:56:10] <wikibugs>	 (03CR) 10Effie Mouzeli: network::constants: add kubepods network constant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli)
[10:56:52] <wikibugs>	 (03PS2) 10Effie Mouzeli: network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910
[10:59:24] <_joe_>	 effie: do we really need to add a global def there?
[11:00:54] <_joe_>	 can't we just join all the ones we need in the firewall rules definition?
[11:01:13] <_joe_>	 CODFW_PRIVATE_PRIVATE1_KUBEPODS_CODFW and so on
[11:01:21] <effie>	 _joe_: my thoughy was that it be needed more than once as we are adding new services 
[11:01:25] <effie>	 thought* 
[11:01:49] <effie>	 not just for maps hosts I mean
[11:01:52] <_joe_>	 heh I was thinking it's better to explicitly list the clusters you give access to
[11:02:05] <_joe_>	 anyhow, bbl sorry
[11:02:19] <effie>	 moritzm: what is your opinion ?
[11:03:17] <wikibugs>	 (03PS3) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582)
[11:03:19] <wikibugs>	 (03PS2) 10Hnowlan: maps: reimage maps1010 as buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702619 (https://phabricator.wikimedia.org/T269582)
[11:09:27] <moritzm>	 effie: no strong opinion, but I think Alex said in the past that we should rather reduce network constants (except the major ones like production_networks) than adding new
[11:10:38] <effie>	 alright, what I generally want is that next time someone needs to add those networks, to do it easily
[11:11:13] <effie>	 what would be the alternative? those subnets are defined in modules/network/data/data.yaml
[11:11:31] <effie>	 I think we would like to avoid defining them again 
[11:11:47] <jinxer-wm>	 (Traffic bill over quota) firing: Traffic bill over quota   - https://alerts.wikimedia.org
[11:14:57] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[11:14:59] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[11:15:05] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[11:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:06] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[11:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:57] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T285603
[11:22:00] <wikibugs>	 (03PS1) 10Effie Mouzeli: profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921
[11:24:18] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/30100/maps1009.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli)
[11:28:34] <mutante>	 !log powercycling mw2380, trying to make it boot 
[11:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:29] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC  https://puppet-compiler.wmflabs.org/compiler1002/30101/" [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli)
[11:31:47] <jinxer-wm>	 (Traffic bill over quota) resolved: Traffic bill over quota   - https://alerts.wikimedia.org
[11:35:55] <wikibugs>	 (03PS1) 10Jelto: add mcrouter certs for mw1415.eqiad.wmnet to mw1421.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702923 (https://phabricator.wikimedia.org/T279309)
[11:39:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[11:40:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[11:40:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] add mcrouter certs for mw1415.eqiad.wmnet to mw1421.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702923 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[11:40:47] <jinxer-wm>	 (Traffic bill over quota) firing: Traffic bill over quota   - https://alerts.wikimedia.org
[11:41:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)
[11:42:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)
[11:42:22] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] add mcrouter certs for mw1415.eqiad.wmnet to mw1421.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/702923 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[11:42:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[11:43:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[11:45:47] <jinxer-wm>	 (Traffic bill over quota) firing: (5) Traffic bill over quota   - https://alerts.wikimedia.org
[11:48:59] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:50:34] <mutante>	 ^ this already works again for me
[11:51:08] <mutante>	 !log mw2380 - PXE booting - does not boot from hard disk
[11:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:13] <icinga-wm>	 RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms
[11:55:23] <icinga-wm>	 RECOVERY - SSH on mw2380 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:00:47] <jinxer-wm>	 (Traffic bill over quota) firing: (5) Traffic bill over quota   - https://alerts.wikimedia.org
[12:02:09] <icinga-wm>	 PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100%
[12:03:11] <icinga-wm>	 RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms
[12:03:43] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:05:47] <jinxer-wm>	 (Traffic bill over quota) resolved: (4) Traffic bill over quota   - https://alerts.wikimedia.org
[12:06:24] <mutante>	 !log mw2380 /puppetmaster: reimaged, revoking old cert, signing new cert, initial puppet run T285603
[12:06:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:33] <stashbot>	 T285603: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603
[12:09:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[12:12:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10BTullis) Added myself to the root alias on puppetmaster1001  Added my GPG key to the pwstore repo.
[12:13:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[12:24:01] <moritzm>	 !log added btullis to pwstore
[12:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:50] <wikibugs>	 (03PS1) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929
[12:39:52] <wikibugs>	 (03PS1) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858)
[12:40:31] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Thanks for the ping - this needs some thought from the DB side. We have some of our misc db masters on row A  - db1159 m1 A6. Affected services: bacula (...
[12:43:08] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @Bstorm @nskaggs please see above - we might need to depool the affected clouddb* hosts.
[12:44:07] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) dbproxy1013 is the active proxy for m2. I will depool it next week.
[12:44:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro)
[12:44:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (owner: 10David Caro)
[12:46:50] <wikibugs>	 (03CR) 10Effie Mouzeli: "This could be a temp solution and find something better after alex is back" [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli)
[12:47:45] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw2380 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[12:47:47] <icinga-wm>	 RECOVERY - MD RAID on mw2380 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:47:49] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2380 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:47:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi)
[12:48:26] <wikibugs>	 (03PS1) 10Ladsgroup: dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/702933 (https://phabricator.wikimedia.org/T273673)
[12:50:11] <icinga-wm>	 PROBLEM - memcached socket on mw2380 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached
[12:52:10] <tgr>	 Krinkle: was the removal of the JS error dashboard in https://wikitech.wikimedia.org/w/index.php?title=Backport_windows/Deployers&diff=next&oldid=1916833 intentional? It seemed useful to me.
[12:56:24] <wikibugs>	 (03PS2) 10David Caro: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858)
[12:56:26] <wikibugs>	 (03PS2) 10David Caro: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858)
[12:57:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli)
[13:00:37] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) Hi @Marostegui thanks for the feedback.  >    Will this stop traffic on all switches at the same time? Or do you plan to d...
[13:02:19] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Dzahn) I reimaged mw2380 and it is booting again now.
[13:02:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2380 is CRITICAL: Host mw2380 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:03:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10dcaro) Installed and put in active on netbox the servers 16, 17, 19 and 20. Waiting for 18 to be fixed :+1:
[13:05:13] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) With my observability and swift maintainer hats on, I think we're ok to tolerate a network blip, specifically:  * ms-be...
[13:08:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2380.codfw.wmnet
[13:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:43] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui)
[13:08:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2380.codfw.wmnet
[13:08:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:03] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193704, @cmooney wrote: > Hi @Marostegui thanks for the feedback. >  >>    Will this stop traffic on al...
[13:09:25] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2380.codfw.wmnet
[13:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:05] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2380 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:11:47] <mutante>	 !log mw2380 - rebooting
[13:11:50] <wikibugs>	 10SRE, 10Performance-Team, 10Thumbor, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10AntiCompositeNumber)
[13:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:39] <icinga-wm>	 PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100%
[13:14:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) The proper fix is T263277.  However there are 2 options to get data quickly and temporarily:  The easiest and "cleanest"...
[13:14:59] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "pcc also fails on releases nodes, due to duplicate declaration:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[13:15:23] <icinga-wm>	 RECOVERY - memcached socket on mw2380 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached
[13:15:25] <icinga-wm>	 RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms
[13:15:28] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @cmooney do you know when you'll know how long this change can take?
[13:15:33] <icinga-wm>	 PROBLEM - Memcached on mw1419 is CRITICAL: connect to address 10.64.0.94 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:15:34] <icinga-wm>	 PROBLEM - Memcached on mw1421 is CRITICAL: connect to address 10.64.0.158 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:15:37] <icinga-wm>	 PROBLEM - Memcached on mw1416 is CRITICAL: connect to address 10.64.0.11 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:15:47] <icinga-wm>	 PROBLEM - Memcached on mw1420 is CRITICAL: connect to address 10.64.0.155 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:15:51] <icinga-wm>	 PROBLEM - Memcached on mw1417 is CRITICAL: connect to address 10.64.0.91 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:16:27] <icinga-wm>	 PROBLEM - Memcached on mw1418 is CRITICAL: connect to address 10.64.0.92 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:16:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli)
[13:19:13] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1420 is CRITICAL: Host mw1420 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:19:47] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10BBlack) Traffic-related bits: * dns1001 will need a manual depool so that it doesn't have knock-on effects on all of the other clus...
[13:20:42] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ayounsi) @Marostegui "as the standby host is on row A too" that sounds like SPOF to me and should be moved to a different row. Due...
[13:20:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2380.codfw.wmnet
[13:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:35] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1415 is CRITICAL: Host mw1415 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:21:38] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193787, @ayounsi wrote: > @Marostegui "as the standby host is on row A too" that sounds like SPOF to me...
[13:21:43] <icinga-wm>	 RECOVERY - Memcached on mw1414 is OK: TCP OK - 0.000 second response time on 10.64.0.160 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:21:53] <icinga-wm>	 PROBLEM - Memcached on mw1415 is CRITICAL: connect to address 10.64.0.9 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:22:04] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[13:22:06] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[13:22:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:13] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[13:22:14] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1420-1421].eqiad.wmnet with reason: setup new appservers in eqiad A3 https://phabricator.wikimedia.org/T279309
[13:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:34] <Krinkle>	 tgr: it was. I don't think it should be a mandatory step, the page is too long, people weren't doing that anyway. It's still promoted among many on logstash home for quick acces
[13:22:56] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli)
[13:25:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: kubernetes: centralize the creation of /etc/kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[13:26:26] <wikibugs>	 (03PS1) 10Jgreen: remove payments100[1-4] from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/702939 (https://phabricator.wikimedia.org/T286044)
[13:27:59] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) @Marostegui Our only real option to test is on new switches due to be installed under T277340.  We are working with DC-Ops...
[13:28:04] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] remove payments100[1-4] from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/702939 (https://phabricator.wikimedia.org/T286044) (owner: 10Jgreen)
[13:29:11] <icinga-wm>	 RECOVERY - Memcached on mw1415 is OK: TCP OK - 0.003 second response time on 10.64.0.9 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:29:27] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Ok, I think what we can do from our side is to get the replacement hosts ready but without failing over things to them,...
[13:29:51] <icinga-wm>	 RECOVERY - Memcached on mw1416 is OK: TCP OK - 0.000 second response time on 10.64.0.11 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:29:57] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo) Speaking on behalf of: ` dbprov1001 ms-backup1001 db1116 ` That could cause ongoing backup runs to fail, but that is "norm...
[13:30:07] <icinga-wm>	 RECOVERY - Memcached on mw1417 is OK: TCP OK - 0.000 second response time on 10.64.0.91 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:30:13] <icinga-wm>	 RECOVERY - Memcached on mw1420 is OK: TCP OK - 7.073 second response time on 10.64.0.155 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:30:51] <icinga-wm>	 RECOVERY - Memcached on mw1418 is OK: TCP OK - 0.000 second response time on 10.64.0.92 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:31:37] <icinga-wm>	 RECOVERY - Memcached on mw1421 is OK: TCP OK - 0.000 second response time on 10.64.0.158 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:31:37] <icinga-wm>	 RECOVERY - Memcached on mw1419 is OK: TCP OK - 0.000 second response time on 10.64.0.94 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:32:07] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=registry200[5-8].codfw.wmnet,dc=codfw,cluster=docker-registry
[13:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:42] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[13:34:44] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912
[13:35:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto)
[13:41:42] <wikibugs>	 10SRE, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10Jgiannelos) Chromium-render is now completely moved on native prometheus metrics using service runner. There were many incompatibilities any many broken...
[13:43:09] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Add 4 new docker-regisry nodes in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/702947 (https://phabricator.wikimedia.org/T286046)
[13:46:03] <wikibugs>	 (03PS2) 10JMeybohm: Revert "Add 4 new docker-regisry nodes in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/702947 (https://phabricator.wikimedia.org/T286046)
[13:49:24] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[13:49:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30105/console" [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto)
[13:51:19] <wikibugs>	 (03PS3) 10Effie Mouzeli: network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910
[13:54:07] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry[2005-2008].codfw.wmnet
[13:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] network::constants: add kubepods network constant [puppet] - 10https://gerrit.wikimedia.org/r/702910 (owner: 10Effie Mouzeli)
[13:56:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli)
[13:57:52] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:58:47] <wikibugs>	 (03PS2) 10Effie Mouzeli: profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921
[13:59:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli)
[13:59:34] <wikibugs>	 (03PS3) 10Effie Mouzeli: profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921
[14:00:13] <wikibugs>	 (03PS4) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582)
[14:01:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=docker-registry site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:01:45] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) 05Open→03Resolved Thank you!
[14:02:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kubernetes: centralize the creation of /etc/kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/702898 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[14:02:45] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] kubernetes: use k8s::base_dir everywhere it's appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702912 (owner: 10Giuseppe Lavagetto)
[14:04:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] profile::maps::postgresql_common: allow connections from kubepods [puppet] - 10https://gerrit.wikimedia.org/r/702921 (owner: 10Effie Mouzeli)
[14:12:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry[2005-2008].codfw.wmnet
[14:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:59] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw141[4-9].eqiad.wmnet
[14:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:40] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw142[0-1].eqiad.wmnet
[14:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:11] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/weight=30; selector: name=mw141[4-9].eqiad.wmnet
[14:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:43] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/weight=30; selector: name=mw142[0-1].eqiad.wmnet
[14:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:22:37] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw141[4-9].eqiad.wmnet
[14:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:59] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw142[0-1].eqiad.wmnet
[14:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "Add 4 new docker-regisry nodes in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/702947 (https://phabricator.wikimedia.org/T286046) (owner: 10JMeybohm)
[14:31:45] <wikibugs>	 (03PS5) 10JMeybohm: dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T286054)
[14:31:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) @jijiki it would not justify such a huge performance shift, by any measure. I am even veering towards disabling onhost memcached, for the latest discoveries of bad interactions with...
[14:38:06] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbstore1004.eqiad.wmnet
[14:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:47] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw141[4-9].eqiad.wmnet
[14:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:08] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw142[0-1].eqiad.wmnet
[14:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:44] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T286057 (10JMeybohm)
[14:48:55] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T286057 (10JMeybohm) a:03JMeybohm
[14:52:25] <logmsgbot>	 !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbstore1004.eqiad.wmnet
[14:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:55] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbstore1004.eqiad.wmnet
[14:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:41] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm for new host dragonfly-supernode1001.eqiad.wmnet
[14:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) Agreed option 1 seems easier and safer than option 2, the sampling isn't great but not the end of the world if we're...
[14:56:55] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1415 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:56:55] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1414 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:56:55] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1420 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:58:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm)
[14:58:28] <mutante>	 jelto: ^ very good. nice work, cy later
[15:01:07] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi)
[15:01:22] <wikibugs>	 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) @MoritzMuehlenhoff Is there any updates on substituting in noto-cjk? The files are still not loading the correct preview characters.  Sample of Unicode characters missing in svg: `鿬鿫...
[15:02:59] <logmsgbot>	 !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbstore1004.eqiad.wmnet
[15:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:23] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host dragonfly-supernode1001.eqiad.wmnet
[15:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:17] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm for new host dragonfly-supernode1001.eqiad.wmnet
[15:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm)
[15:15:58] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[15:17:00] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[15:17:06] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dragonfly-supernode1001.eqiad.wmnet
[15:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:06] <wikibugs>	 (03PS1) 10JMeybohm: dragonfly: Remove fetching of $docker_registry_fqdn cert [puppet] - 10https://gerrit.wikimedia.org/r/702979 (https://phabricator.wikimedia.org/T286054)
[15:25:18] <wikibugs>	 (03PS1) 10JMeybohm: site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/702982 (https://phabricator.wikimedia.org/T286057)
[15:26:27] <wikibugs>	 (03PS1) 10Elukey: profile::kubernetes::node: add hiera config to expose puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927)
[15:28:30] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30108/console" [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[15:28:38] <wikibugs>	 (03CR) 10Elukey: "The alternative to this could be a profile with a separate hiera config, that deploys certs if needed." [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[15:29:09] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:18] <wikibugs>	 (03PS1) 10Hnowlan: maps: standardised the maps2.0 config in eqiad, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702984 (https://phabricator.wikimedia.org/T269582)
[15:35:11] <wikibugs>	 (03CR) 10Elukey: "Ah no of course this is another use case of different users owning files, will need to dig a bit more" [puppet] - 10https://gerrit.wikimedia.org/r/702983 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey)
[15:37:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema)
[15:40:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney)
[15:40:36] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) I created a first basic dashboard for the mwdebug deployment and I noticed what the major issue was immediately: I dedicated just 2k maximum opcache scripts, which bottomed out even...
[15:41:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)
[15:44:05] <wikibugs>	 (03PS2) 10JMeybohm: site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/702982 (https://phabricator.wikimedia.org/T286057)
[15:48:34] <wikibugs>	 (03PS5) 10Hnowlan: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos)
[15:49:56] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30110/console" [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos)
[15:53:05] <wikibugs>	 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10dancy) >>! In T284772#7193169, @MoritzMuehlenhoff wrote: > Ok, thanks we can bump vcpus to 8, but per the host metrics doubling RAM wouldn't seem to make any measurable difference, right?  I suspect that more...
[15:54:13] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.dns.netbox
[15:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:15] <icinga-wm>	 RECOVERY - NFS Share Volume Space /srv/scratch on cloudstore1008 is OK: DISK OK - free space: /srv/scratch 1070613 MB (27% inode=99%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1
[15:59:33] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:45] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[16:03:16] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+1] "I think this is ready to go once we're back from holiday." [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos)
[16:09:22] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] varnish: use 403 instead of 429 where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/702896 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema)
[16:11:00] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Andrew Just a heads up that cloudcontrol1003 is in the list. It might be fine and will catch up, but it also could crash r...
[16:11:10] <wikibugs>	 (03PS1) 10Kormat: dbstore1004: Rename to db1183, prep for m5. [puppet] - 10https://gerrit.wikimedia.org/r/702988 (https://phabricator.wikimedia.org/T284622)
[16:11:52] <wikibugs>	 (03PS2) 10Kormat: dbstore1004: Rename to db1183, prep for m5. [puppet] - 10https://gerrit.wikimedia.org/r/702988 (https://phabricator.wikimedia.org/T284622)
[16:13:07] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm)
[16:13:20] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] dbstore1004: Rename to db1183, prep for m5. [puppet] - 10https://gerrit.wikimedia.org/r/702988 (https://phabricator.wikimedia.org/T284622) (owner: 10Kormat)
[16:13:49] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Ottomata One of the cloudbs is clouddb1021. FYI. I understand you likely won't be using it that late in the month, but I w...
[16:14:21] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) lists1001 is a SPOF currently, we'll probably just announce a downtime when we get closer to the actual time
[16:28:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[16:29:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)
[16:31:11] <wikibugs>	 (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/30111/" [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm)
[16:35:48] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Ladsgroup) I can draft an announcement for downtime of lists.wikimedia.org, maybe we can use the time to increase its capacity (mor...
[16:36:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[16:37:06] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm)
[16:44:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm)
[16:52:44] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Next steps:  * refactor how `base::expose_puppet_certs` is used in kubernetes profiles, since if profile::...
[16:54:27] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10nskaggs) Impacted clouddb's will be clouddb1013, clouddb1014, clouddb1021.  I believe interrupting traffic on 2 of 4 of the "web" r...
[16:55:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[16:56:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney)
[16:59:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) @dcaro and @Andrew on the ceph and cloudvirts, I have concerns. We've seen that a lack of network to enough OSDs for a while will cause problems, and the cluster can...
[17:02:19] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Set up spare lists host in codfw, document failover procedure - https://phabricator.wikimedia.org/T286071 (10Legoktm)
[17:17:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) I'll make a meeting for our team to discuss. There is a ticket for row B as well :)
[17:19:08] <wikibugs>	 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10Legoktm) thanks @AntiCompositeNumber for fixing the image and the additional detail. I agree with declining t...
[17:21:11] <wikibugs>	 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10Legoktm) Would it be useful if I dug through the logs for other such files that don't thumb correctly?
[17:23:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10faidon) >>! In T232343#7058654, @herron wrote: > **Lists:** Lists/mailman has an internet facing exim inst...
[17:27:52] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10wkandek) Is this the dashboard? https://grafana.wikimedia.org/d/U7JT--knk/joe-k8s-mwdebug?viewPanel=70&orgId=1&from=1625227688488&to=1625246654342
[17:30:54] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "> tail -n +3 means, output last lines starting with line number 3 (so skip the first two)." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[17:31:08] <wikibugs>	 (03PS3) 10Brennen Bearnes: fix cleanup of config backups, make script more robust [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[17:31:32] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] fix cleanup of config backups, make script more robust [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[17:35:53] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:57:29] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:21] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ayounsi)
[18:02:52] <wikibugs>	 10SRE, 10DNS, 10Traffic: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10HMarcus) 05Open→03Resolved a:03HMarcus Thanks, this can be closed.
[18:04:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10Legoktm) >>! In T232343#7194626, @faidon wrote: > While some of them could be mitigated (e.g. separate exi...
[18:08:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:59] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:22:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:29] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:05] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) It would also be nice if the cookbook could check all services, and then fail if at least one didn't verify...
[18:52:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Cmjohnson)
[19:35:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Cmjohnson) All are finished with on-site tasks, the raid configuration was also completed.
[19:49:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[19:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:01] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:25] <wikibugs>	 (03PS1) 10Bstorm: jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004
[20:20:33] <wikibugs>	 (03PS1) 10Bstorm: jessie deprecation: don't build jessie containers when rebuilding [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005
[20:28:25] <wikibugs>	 (03CR) 10Bstorm: "Note: this is how I built the container that we are using." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm)
[21:28:13] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/699471 (https://phabricator.wikimedia.org/T267683) (owner: 10Bstorm)
[21:33:01] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10brennen)
[21:33:15] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10brennen) 05Open→03Resolved Port is open and in use.  > Sure you want to open ssh to the public before backups and l...
[22:06:15] <foks>	 !log removing three files for legal compliance
[22:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:30] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm)
[22:20:11] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm)
[22:20:42] <wikibugs>	 (03Merged) 10jenkins-bot: jdk8 sucks: Work around strange issue with debian alternatives system [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703004 (owner: 10Bstorm)
[23:48:04] <wikibugs>	 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry)
[23:48:57] <wikibugs>	 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) 05Resolved→03Open Reopening to follow up on the failure to fully serve all the static files.  To followup...