[00:00:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:01:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:05] !log egardner@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/MediaSearch: Backport: [[gerrit:710387|Revert "Open search result links in-place"]] (duration: 00m 58s) [00:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:09:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:12:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2064.codfw.wmnet with reason: REIMAGE [00:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2064.codfw.wmnet with reason: REIMAGE [00:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:51] PROBLEM - Host ms-be2064 is DOWN: PING CRITICAL - Packet loss = 100% [00:25:09] RECOVERY - Host ms-be2064 is UP: PING OK - Packet loss = 0%, RTA = 31.12 ms [00:27:05] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:35] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2064.codfw.wmnet'] ` and were **ALL** successful. [00:41:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:43:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ms-be2065.codfw.wmnet ` The log can be found in `/var/l... [00:43:41] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2065.codfw.wmnet with reason: REIMAGE [00:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2065.codfw.wmnet with reason: REIMAGE [01:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:11] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_httpbb.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:06:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:08:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:12:31] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:14:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:23:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2065.codfw.wmnet'] ` and were **ALL** successful. [01:25:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10Papaul) [01:25:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete [01:28:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:34:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:35:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:35:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:39:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:40:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:41:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:46:31] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:56:29] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:00:21] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:25] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [02:10:15] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.02482 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [02:12:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:21:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:25:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:57:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:59:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:02:19] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:29] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [03:10:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:14:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:23:05] (03PS1) 10Tim Starling: Update bv2017/voterList.php to make a new generic script [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710395 [03:23:14] (03CR) 10Tim Starling: [C: 03+2] Update bv2017/voterList.php to make a new generic script [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710395 (owner: 10Tim Starling) [03:23:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:26:25] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.0212 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [03:27:17] (03Merged) 10jenkins-bot: Update bv2017/voterList.php to make a new generic script [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710395 (owner: 10Tim Starling) [03:30:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:32:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:52:38] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:54:06] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/SecurePoll/cli/wm-scripts/makeGlobalVoterList.php: need to run this script T288025 (duration: 00m 57s) [03:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:13] T288025: Create SecurePoll voter list for 2021 board vote - https://phabricator.wikimedia.org/T288025 [03:55:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:56:25] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [03:56:55] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:45] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:03] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01856 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [04:03:13] !log on mwmaint1002 mwscript extensions/SecurePoll/cli/wm-scripts/makeGlobalVoterList.php --wiki=mediawikiwiki --edit-count-table=bv2021_edits --list-name=board-vote-2021 --short-min-edits=20 --long-min-edits=300 [04:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:10:46] (03CR) 10Thcipriani: [C: 03+1] "noc is the first place I looked, so this makes sense to me. Thank you for this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [04:13:08] (03Abandoned) 10Thcipriani: feat: Add mwdebug cname [dns] - 10https://gerrit.wikimedia.org/r/708874 (owner: 10Thcipriani) [04:19:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:29:15] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:38:12] (03PS3) 10Marostegui: production-m5.sql.erb: Add toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) [04:38:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:39:12] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Add toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [04:39:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:15] (03PS1) 10Marostegui: production-m5.sql: Add dbproxy20040's IP [puppet] - 10https://gerrit.wikimedia.org/r/710416 (https://phabricator.wikimedia.org/T271480) [04:43:55] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add dbproxy20040's IP [puppet] - 10https://gerrit.wikimedia.org/r/710416 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [04:45:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:55] (03PS1) 10Marostegui: production-m5.sql: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/710417 [04:54:28] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/710417 (owner: 10Marostegui) [05:03:55] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:21] (03PS1) 10Marostegui: dumps-eqiad-m5.sql: Grants to backup toolhub database. [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) [05:10:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:14:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:15:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:21:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:23:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:28:02] (03PS1) 10Giuseppe Lavagetto: services_proxy: add keepalive for shellboxes [puppet] - 10https://gerrit.wikimedia.org/r/710420 (https://phabricator.wikimedia.org/T287288) [05:31:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:01] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:33:49] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [05:33:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:35:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30501/console" [puppet] - 10https://gerrit.wikimedia.org/r/710420 (https://phabricator.wikimedia.org/T287288) (owner: 10Giuseppe Lavagetto) [05:37:39] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01348 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [05:41:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] services_proxy: add keepalive for shellboxes [puppet] - 10https://gerrit.wikimedia.org/r/710420 (https://phabricator.wikimedia.org/T287288) (owner: 10Giuseppe Lavagetto) [05:44:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1160 T288273', diff saved to https://phabricator.wikimedia.org/P16965 and previous config saved to /var/cache/conftool/dbconfig/20210806-054433-marostegui.json [05:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:42] T288273: Please optimize image table in commonswiki - https://phabricator.wikimedia.org/T288273 [05:45:29] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710422 (https://phabricator.wikimedia.org/T288273) [05:46:11] (03CR) 10Marostegui: [C: 03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710422 (https://phabricator.wikimedia.org/T288273) (owner: 10Marostegui) [05:47:31] !log Optimize commonswiki.image on db1160 T288273 [05:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:47] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:02:43] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:05:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:22] interesting, something similar happened yesterday morning as well right --^ ? [06:26:03] (03CR) 10Jcrespo: "Wouldn't this require the same change on the codfw ones?" [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [06:26:29] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [06:26:30] (03CR) 10Marostegui: "Oh yes! Doing it - thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [06:28:13] (03PS2) 10Marostegui: dumps-*-m5.sql: Grants to backup toolhub database. [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) [06:30:53] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.0184 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [06:30:56] (03CR) 10Jcrespo: [C: 03+1] "Should I deploy? or you do?" [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [06:31:13] (03CR) 10Marostegui: "Feel free to go ahead! Much appreciated :)" [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [06:42:21] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) Thanks <3 [06:43:37] !log Reboot db1107 to upgrade its kernel [06:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:17] (03PS1) 10Elukey: kubeflow: create a separate chart for its Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710481 (https://phabricator.wikimedia.org/T272919) [06:47:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10fgiunchedi) @papaul thank you so much for the speedy action on this! [06:47:44] (03CR) 10jerkins-bot: [V: 04-1] kubeflow: create a separate chart for its Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710481 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [06:50:05] (03PS1) 10Ladsgroup: Disable DPL on wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710482 (https://phabricator.wikimedia.org/T287916) [06:51:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:17] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route Icinga alerts before team routes [puppet] - 10https://gerrit.wikimedia.org/r/710287 (owner: 10Filippo Giunchedi) [06:54:44] (03PS2) 10Elukey: kubeflow: create a separate chart for its Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710481 (https://phabricator.wikimedia.org/T272919) [06:56:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:56:50] (03PS3) 10Elukey: kubeflow: create a separate chart for its Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710481 (https://phabricator.wikimedia.org/T272919) [06:57:25] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30502/console" [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [06:58:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:59:55] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: added some wmcs team label configs and default sre [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [06:59:57] (03CR) 10Filippo Giunchedi: [C: 03+2] profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 (owner: 10David Caro) [07:00:00] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus.icinga_exporter: Add label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210806T0700) [07:02:31] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [07:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:15] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: extend ssd tier retention from 15 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/710341 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [07:10:01] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:46] (03CR) 10Ema: [C: 03+2] pontoon: initialize new stack traffic [puppet] - 10https://gerrit.wikimedia.org/r/710279 (owner: 10Ema) [07:12:10] (03PS2) 10Ema: pontoon: add cptext and cpupload [puppet] - 10https://gerrit.wikimedia.org/r/710280 [07:13:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:18] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [07:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:26] (03CR) 10Jcrespo: [C: 03+2] dumps-*-m5.sql: Grants to backup toolhub database. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710419 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [07:22:43] (03PS1) 10Elukey: kubeflow: add pre-hook ordering for Namespace and Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710483 (https://phabricator.wikimedia.org/T272919) [07:22:45] (03Abandoned) 10Elukey: kubeflow: create a separate chart for its Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710481 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:23:02] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Notify dfdaemon on certificate change [puppet] - 10https://gerrit.wikimedia.org/r/710484 (https://phabricator.wikimedia.org/T286054) [07:24:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30503/console" [puppet] - 10https://gerrit.wikimedia.org/r/710484 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [07:28:13] (03CR) 10Elukey: [C: 03+2] kubeflow: add pre-hook ordering for Namespace and Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/710483 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:30:50] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/710358 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [07:31:01] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:32:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:41] (03CR) 10JMeybohm: [C: 04-1] eventgate - Disable http service if tls.enabled (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710111 (https://phabricator.wikimedia.org/T255871) (owner: 10Ottomata) [07:32:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:36:47] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [07:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:56] (03PS1) 10JMeybohm: appserver_dragonfly: Remove experimental stuff from appservers [puppet] - 10https://gerrit.wikimedia.org/r/710485 (https://phabricator.wikimedia.org/T286054) [07:38:43] (03PS1) 10Elukey: kubeflow: fix the cert name for the webhook TLS certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/710486 (https://phabricator.wikimedia.org/T272919) [07:41:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:03] (03PS1) 10JMeybohm: Clean up appserver_dragonfly test role [puppet] - 10https://gerrit.wikimedia.org/r/710487 (https://phabricator.wikimedia.org/T286054) [07:42:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:36] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30504/console" [puppet] - 10https://gerrit.wikimedia.org/r/710485 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [07:45:43] (03CR) 10Elukey: [C: 03+2] kubeflow: fix the cert name for the webhook TLS certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/710486 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:48:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:48:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:57] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] dragonfly::dfdaemon: Notify dfdaemon on certificate change [puppet] - 10https://gerrit.wikimedia.org/r/710484 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [07:52:04] (03PS1) 10Marostegui: production-m5.sql.erb: Add dbproxy2004 grants [puppet] - 10https://gerrit.wikimedia.org/r/710489 (https://phabricator.wikimedia.org/T288093) [07:52:46] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Add dbproxy2004 grants [puppet] - 10https://gerrit.wikimedia.org/r/710489 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [07:54:08] (03CR) 10Jgiannelos: postgresql::user: split HBA configuration into a different define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [07:55:44] (03CR) 10Jgiannelos: postgresql::user: split HBA configuration into a different define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [07:56:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:35] (03CR) 10Ema: [C: 03+2] pontoon: add cptext and cpupload [puppet] - 10https://gerrit.wikimedia.org/r/710280 (owner: 10Ema) [07:58:16] (03CR) 10Jgiannelos: [C: 03+1] "Looks OK to me. I think though its blocked by https://gerrit.wikimedia.org/r/c/operations/puppet/+/709717." [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [07:58:24] !log test thanos 0.21 on thanos-fe2001 - T288326 [07:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:31] T288326: thanos compact crash during downsampling and restart on invalid checksum for large block - https://phabricator.wikimedia.org/T288326 [08:00:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:09:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:31] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:18:23] (03PS1) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [08:18:25] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:18:51] (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [08:21:42] (03PS2) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [08:22:42] (03CR) 10MSantos: [C: 03+1] profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [08:24:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:16] (03PS1) 10Elukey: kubeflow: add container env variables to reach the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/710493 (https://phabricator.wikimedia.org/T272919) [08:25:30] (03CR) 10MSantos: maps: reimage maps2005 as buster replica of maps2009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [08:26:23] (03PS2) 10Elukey: kubeflow: add container env variables to reach the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/710493 (https://phabricator.wikimedia.org/T272919) [08:26:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:17] (03CR) 10Elukey: [C: 03+2] kubeflow: add container env variables to reach the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/710493 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [08:30:30] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [08:30:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:31:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:45] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:35:15] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add alerting cluster puppet fail [alerts] - 10https://gerrit.wikimedia.org/r/710248 (https://phabricator.wikimedia.org/T283151) (owner: 10Filippo Giunchedi) [08:36:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:37:05] (03PS3) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [08:38:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:30] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10fgiunchedi) 05Open→03Resolved We have an `alerting` cluster specific puppet failure alert... [08:38:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:04] (03CR) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [08:39:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:42:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:43:21] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:41] (03PS1) 10Vgutierrez: envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) [08:50:43] (03PS1) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [08:55:31] (03PS4) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [08:57:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:57:47] (03CR) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [08:59:56] (03PS5) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [09:02:24] (03PS1) 10Vgutierrez: envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) [09:05:14] (03PS1) 10David Caro: am: added main function tests and small refactor [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 [09:05:16] (03PS1) 10David Caro: global: linted and added vim files to gitignore [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710498 [09:05:38] (03CR) 10David Caro: "In retrospective, I should have done this at the beginning xd" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [09:06:29] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:06:55] (03CR) 10David Caro: [C: 04-1] am: added main function tests and small refactor (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [09:07:04] (03PS6) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [09:10:21] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:10:48] (03PS2) 10David Caro: am: added main function tests and small refactor [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 [09:10:50] (03PS2) 10David Caro: global: linted and added vim files to gitignore [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710498 [09:11:09] (03PS2) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [09:12:05] (03PS7) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [09:13:12] (03PS1) 10Filippo Giunchedi: grafana: unset X-CAS-uid [puppet] - 10https://gerrit.wikimedia.org/r/710499 (https://phabricator.wikimedia.org/T288286) [09:14:34] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:27] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet [09:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:49] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2005.codfw.wmnet with reason: Rebuilding as buster replica of maps1009 [09:15:51] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2005.codfw.wmnet with reason: Rebuilding as buster replica of maps1009 [09:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:18:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I can already see that this absenting will not clean up everything (as usual, for absented resources in puppet) but it can remove some cru" [puppet] - 10https://gerrit.wikimedia.org/r/710485 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:19:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This should only be merged once the preceding patch has fully applied on the appservers, or you'll end up with a mixed situation." [puppet] - 10https://gerrit.wikimedia.org/r/710487 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:19:55] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] appserver_dragonfly: Remove experimental stuff from appservers [puppet] - 10https://gerrit.wikimedia.org/r/710485 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:21:06] (03CR) 10Filippo Giunchedi: mediabackups: Switch TLS certificates to PKI rather than puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [09:21:22] (03CR) 10Hnowlan: maps: reimage maps2005 as buster replica of maps2009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:21:36] (03PS2) 10Hnowlan: maps: reimage maps2005 as buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) [09:22:09] (03PS2) 10Vgutierrez: envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) [09:25:52] (03CR) 10Kormat: [C: 03+1] grafana: unset X-CAS-uid [puppet] - 10https://gerrit.wikimedia.org/r/710499 (https://phabricator.wikimedia.org/T288286) (owner: 10Filippo Giunchedi) [09:26:51] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: unset X-CAS-uid [puppet] - 10https://gerrit.wikimedia.org/r/710499 (https://phabricator.wikimedia.org/T288286) (owner: 10Filippo Giunchedi) [09:27:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:35] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:33:28] (03PS2) 10Btullis: Redirect jupyter notebook logs to logstash via kafka [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) [09:34:15] (03CR) 10Filippo Giunchedi: [C: 03+1] mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [09:37:17] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) [09:38:43] (03CR) 10Kormat: "LGTM, i'll merge it shortly." [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [09:39:28] (03CR) 10Kormat: [C: 03+2] xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [09:40:14] (03PS8) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) [09:42:55] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Switch TLS certificates to PKI rather than puppet [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [09:43:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:46:19] (03CR) 10Kormat: "Change merged, and deployed to xhgui2001.codfw.wmnet,xhgui1001.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [09:48:05] (03CR) 10Btullis: "I'm CC'ing members of o11y so that they are made aware of our intention to redirect users' Jupyter notebook logs to Logstash." [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) (owner: 10Btullis) [09:48:47] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:49:45] (03PS1) 10Kormat: db1181: Disable notifications for reimage. [puppet] - 10https://gerrit.wikimedia.org/r/710501 (https://phabricator.wikimedia.org/T288244) [09:50:05] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:31] (03PS3) 10Btullis: Redirect jupyter notebook logs to logstash via kafka [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) [09:50:36] (03CR) 10Kormat: [C: 03+2] db1181: Disable notifications for reimage. [puppet] - 10https://gerrit.wikimedia.org/r/710501 (https://phabricator.wikimedia.org/T288244) (owner: 10Kormat) [09:51:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:52:04] (03PS1) 10Kormat: install_server: switch db1181 to buster [puppet] - 10https://gerrit.wikimedia.org/r/710502 (https://phabricator.wikimedia.org/T288244) [09:52:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:53:33] (03CR) 10Kormat: [C: 03+2] install_server: switch db1181 to buster [puppet] - 10https://gerrit.wikimedia.org/r/710502 (https://phabricator.wikimedia.org/T288244) (owner: 10Kormat) [09:55:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:56:06] (03PS1) 10Jcrespo: mediabackups: Make /etc/minio/ssl writeable by owner [puppet] - 10https://gerrit.wikimedia.org/r/710503 (https://phabricator.wikimedia.org/T222113) [09:56:59] (03CR) 10Filippo Giunchedi: [C: 03+1] mediabackups: Make /etc/minio/ssl writeable by owner [puppet] - 10https://gerrit.wikimedia.org/r/710503 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [09:57:16] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Make /etc/minio/ssl writeable by owner [puppet] - 10https://gerrit.wikimedia.org/r/710503 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [09:58:33] !log reimaging db1181 (s7) to buster T288244 [09:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:40] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [10:03:04] 10SRE, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474 (10Jgiannelos) Since we already have done some research for using swift as a vector tile storage and tegola is already running backed by swift on staging k8s should we close this ticket or there is more thi... [10:07:18] (03PS1) 10Ayounsi: Allow mgmt to reach apt.wo [homer/public] - 10https://gerrit.wikimedia.org/r/710506 (https://phabricator.wikimedia.org/T277340) [10:10:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:10:21] (03PS1) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [10:10:52] (03PS3) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [10:12:22] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:12:45] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:13:05] (03CR) 10Ayounsi: [C: 03+2] Allow mgmt to reach apt.wo [homer/public] - 10https://gerrit.wikimedia.org/r/710506 (https://phabricator.wikimedia.org/T277340) (owner: 10Ayounsi) [10:13:42] (03Merged) 10jenkins-bot: Allow mgmt to reach apt.wo [homer/public] - 10https://gerrit.wikimedia.org/r/710506 (https://phabricator.wikimedia.org/T277340) (owner: 10Ayounsi) [10:14:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE [10:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE [10:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:07] (03PS1) 10Hnowlan: maps: make maps1006 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/710509 (https://phabricator.wikimedia.org/T269582) [10:24:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) (owner: 10Btullis) [10:24:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [10:29:56] (03PS1) 10Jcrespo: Revert "prometheus: Add hosts_only=false on minio job" [puppet] - 10https://gerrit.wikimedia.org/r/710404 [10:30:06] (03PS2) 10Jcrespo: Revert "prometheus: Add hosts_only=false on minio job" [puppet] - 10https://gerrit.wikimedia.org/r/710404 [10:31:34] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus: Add hosts_only=false on minio job" [puppet] - 10https://gerrit.wikimedia.org/r/710404 (owner: 10Jcrespo) [10:32:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) (owner: 10Vgutierrez) [10:33:51] (03PS3) 10Jcrespo: Revert "prometheus: Add hosts_only=false on minio job" [puppet] - 10https://gerrit.wikimedia.org/r/710404 [10:34:31] (03CR) 10Jgiannelos: [C: 03+1] maps: reimage maps2005 as buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:35:34] (03PS2) 10JMeybohm: Clean up appserver_dragonfly test role [puppet] - 10https://gerrit.wikimedia.org/r/710487 (https://phabricator.wikimedia.org/T286054) [10:37:12] (03CR) 10JMeybohm: [C: 03+2] Clean up appserver_dragonfly test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710487 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:38:58] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30509/console" [puppet] - 10https://gerrit.wikimedia.org/r/710487 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:42:23] (03PS2) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [10:43:29] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:45:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:46:45] (03PS3) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [10:46:55] (03CR) 10MSantos: [C: 03+1] maps: reimage maps2005 as buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:48:41] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:50:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:53:06] (03PS2) 10Hnowlan: maps: make maps1006 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/710509 (https://phabricator.wikimedia.org/T269582) [10:54:25] (03CR) 10Ayounsi: [C: 03+1] Add doh5002 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/710358 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [10:56:10] (03PS2) 10Btullis: Add a CNAME entry for analytics-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/709695 (https://phabricator.wikimedia.org/T273642) [10:56:12] (03CR) 10Hnowlan: [C: 03+2] maps: make maps1006 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/710509 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:57:45] (03CR) 10Jcrespo: [C: 03+2] Revert "prometheus: Add hosts_only=false on minio job" [puppet] - 10https://gerrit.wikimedia.org/r/710404 (owner: 10Jcrespo) [10:58:35] (03CR) 10Btullis: [C: 03+2] Redirect jupyter notebook logs to logstash via kafka [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) (owner: 10Btullis) [11:14:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1006.eqiad.wmnet with reason: REIMAGE [11:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:15:05] (03PS1) 10Ladsgroup: Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) [11:15:53] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:16:35] (03PS2) 10Ladsgroup: Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) [11:16:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1006.eqiad.wmnet with reason: REIMAGE [11:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:17:11] (03PS1) 10Marostegui: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/710516 (https://phabricator.wikimedia.org/T287454) [11:17:40] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/710516 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [11:18:19] (03PS1) 10Marostegui: wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/710517 (https://phabricator.wikimedia.org/T287454) [11:18:53] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/710517 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [11:28:58] (03PS4) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [11:29:18] (03PS1) 10Ladsgroup: mediawiki: Migrate dispatching cron of testwikidatawiki to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/710519 (https://phabricator.wikimedia.org/T288175) [11:29:20] (03PS1) 10Ladsgroup: mediawiki: Migrate wikidatawiki dispatch crons to three systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/710520 (https://phabricator.wikimedia.org/T288175) [11:29:30] (03PS1) 10Btullis: Exclude jupyterhub notebooks from local logging [puppet] - 10https://gerrit.wikimedia.org/r/710521 (https://phabricator.wikimedia.org/T287339) [11:31:14] (03CR) 10Btullis: [C: 03+2] Exclude jupyterhub notebooks from local logging [puppet] - 10https://gerrit.wikimedia.org/r/710521 (https://phabricator.wikimedia.org/T287339) (owner: 10Btullis) [11:32:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:34:28] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech, and 2 others: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) a:03Ladsgroup [11:35:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:35:25] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Ensure on codfw kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/710522 (https://phabricator.wikimedia.org/T286054) [11:37:03] (03PS1) 10Cathal Mooney: Exposed Netbox interface 'type' value so it can be used in templates. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/710523 (https://phabricator.wikimedia.org/T288343) [11:38:34] (03CR) 10Filippo Giunchedi: [C: 03+1] global: linted and added vim files to gitignore [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710498 (owner: 10David Caro) [11:38:46] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:39:34] (03PS1) 10Cathal Mooney: Change to interface templates for mr routers. [homer/public] - 10https://gerrit.wikimedia.org/r/710525 (https://phabricator.wikimedia.org/T288343) [11:42:31] (03PS1) 10Marostegui: production-m5.sql: Add more grants to dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/710526 (https://phabricator.wikimedia.org/T288093) [11:42:45] (03CR) 10JMeybohm: [C: 03+2] dragonfly::dfdaemon: Ensure on codfw kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/710522 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [11:43:57] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add more grants to dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/710526 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [11:45:04] !log enabling dragonfly dfdaemon on kubernetes200* [11:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:50:02] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 115 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:50:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:52:24] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:58:23] (03CR) 10Ayounsi: [C: 03+1] "LGTM! Clean implementation." [homer/public] - 10https://gerrit.wikimedia.org/r/710525 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [11:58:42] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/710523 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [12:02:50] (03PS1) 10JMeybohm: Add dragonfly-peer and supernode cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/710528 (https://phabricator.wikimedia.org/T286054) [12:07:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:08:24] (03PS1) 10Jelto: remove backup warning for config backups [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/710529 (https://phabricator.wikimedia.org/T288324) [12:09:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:12:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:25] (03CR) 10Jelto: "After upgrading to GitLab 13.12.9 a new WARNING message is generated when doing the configuration backups. The WARNING message is hardcode" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/710529 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [12:18:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:20:43] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:20:44] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:09] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:21:10] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:22:30] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:22:30] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:37] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:00] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:25:16] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:25:20] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:35] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [12:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:32] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:34:27] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:12] (03PS1) 10Ayounsi: Add cloudsw2-c8-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/710534 (https://phabricator.wikimedia.org/T277340) [12:42:13] (03CR) 10Awight: "I don't know what the diffConfig CI failure is about, clicking through to the job it reports success." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709027 (https://phabricator.wikimedia.org/T286765) (owner: 10Awight) [12:46:25] awight: I wouldn't worry about diffConfig. The only time I've seen it work is with db lists. [12:48:33] (03CR) 10David Caro: [C: 04-1] am: added main function tests and small refactor (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [12:48:34] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:29] (03PS3) 10David Caro: am: added main function tests and small refactor [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 [12:54:31] (03PS3) 10David Caro: global: linted and added vim files to gitignore [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710498 [12:54:39] (03CR) 10David Caro: [C: 04-1] am: added main function tests and small refactor (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [12:56:11] !log test thanos 0.22 on thanos-fe2001 - T288326 [12:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:18] T288326: thanos compact crash during downsampling and restart on invalid checksum for large block - https://phabricator.wikimedia.org/T288326 [12:56:34] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:57:43] (03PS1) 10Kormat: Revert "db1181: Disable notifications for reimage." [puppet] - 10https://gerrit.wikimedia.org/r/710550 [12:58:37] (03CR) 10Kormat: [C: 03+2] Revert "db1181: Disable notifications for reimage." [puppet] - 10https://gerrit.wikimedia.org/r/710550 (owner: 10Kormat) [13:01:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:03:50] this is the lumen transport link, is there maintenance for it? [13:03:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:04:46] ah yes there is some work in progress, not scheduled [13:05:37] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:05:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:06:39] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.00575 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:07:18] elukey: https://phabricator.wikimedia.org/T288218 is from earlier in week [13:07:46] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:32] RhinosF1: thanks! In this case the transport link is for esams <=> eqiad, there is unexpected maintenance [13:10:42] Ah [13:11:04] Someone should probably see if that task is still needed though [13:13:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:14:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:16:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice, thank you!" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [13:24:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:29:39] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:33:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:35:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] (03PS1) 10Kormat: site.pp: Fix location of db1167 (s8, not s7) [puppet] - 10https://gerrit.wikimedia.org/r/710540 [13:38:57] (03CR) 10Btullis: [C: 03+2] Add a CNAME entry for analytics-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/709695 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:40:17] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:32] marostegui: are you leaving things in puppet to see if i'm paying attention again? :P https://gerrit.wikimedia.org/r/710540, https://gerrit.wikimedia.org/r/c/operations/puppet/+/684690 [13:42:09] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:42:11] lots of movements! [13:42:35] marostegui: ಠ_ಠ [13:44:47] (03CR) 10Kormat: [C: 03+2] site.pp: Fix location of db1167 (s8, not s7) [puppet] - 10https://gerrit.wikimedia.org/r/710540 (owner: 10Kormat) [13:45:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Enable new VE template dialog sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709027 (https://phabricator.wikimedia.org/T286765) (owner: 10Awight) [13:46:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:50:09] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. I think the OSPF is a small detail, if we end up with a lot of such devices it might be worth a special template but as things are " [homer/public] - 10https://gerrit.wikimedia.org/r/710534 (https://phabricator.wikimedia.org/T277340) (owner: 10Ayounsi) [13:53:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:55:13] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:57:03] (03PS1) 10Kormat: mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) [13:57:18] (03PS3) 10Hnowlan: maps: reimage maps2005 as buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) [13:58:07] (03PS2) 10Kormat: mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) [13:58:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:00:33] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:01:11] (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps2005 as buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [14:04:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:04:26] (03PS1) 10Hnowlan: maps: add missing postgres entry for maps2005 [puppet] - 10https://gerrit.wikimedia.org/r/710544 (https://phabricator.wikimedia.org/T269582) [14:05:18] (03PS3) 10Kormat: mariadb: Add specific role for sanitarium masters. [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) [14:06:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:07:00] (03CR) 10Hnowlan: [C: 03+2] maps: add missing postgres entry for maps2005 [puppet] - 10https://gerrit.wikimedia.org/r/710544 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [14:11:08] (03CR) 10Kormat: "recheck experimental" [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [14:13:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:13:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) @ayounsi yeah you need to complete the usual IPv4 request form I believe. I'll dig into it and will run the fo... [14:13:42] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [14:15:45] hashar: does 'check experimental' still work with the new version of gerrit ^? [14:16:14] kormat: why would it not work? [14:16:33] oh, it just did [14:16:41] hashar: i couldn't find a running job for the pcc check [14:16:43] I have definitely tested that `recheck` does produce the proper event for CI to act on [14:16:47] AHH [14:16:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:16:52] so usually I just head to https://integration.wikimedia.org/zuul/ [14:17:07] and in the search input box at top enter the repo (ex: `puppet` ) [14:17:18] it is not ideal, the progress should be shown directly on the gerrit change [14:20:46] (03CR) 10Kormat: "PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/710541 (https://phabricator.wikimedia.org/T285390) (owner: 10Kormat) [14:22:51] (03PS1) 10JMeybohm: Test node_labels [puppet] - 10https://gerrit.wikimedia.org/r/710566 [14:24:28] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2005.codfw.wmnet with reason: REIMAGE [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on maps2005.codfw.wmnet with reason: REIMAGE [14:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:12] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30512/console" [puppet] - 10https://gerrit.wikimedia.org/r/710566 (owner: 10JMeybohm) [14:28:40] (03PS4) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [14:29:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps2005.codfw.wmnet with reason: Reimaging [14:29:40] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2005.codfw.wmnet with reason: Reimaging [14:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:09] (03PS5) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [14:35:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [14:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:40:36] (03PS1) 10Ema: pontoon: add acmechief to traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710569 [14:41:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:43:50] (03PS2) 10JMeybohm: kubernetes::node: Add node.kubernetes.io/disk-type annotation [puppet] - 10https://gerrit.wikimedia.org/r/710566 (https://phabricator.wikimedia.org/T288345) [14:55:08] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:56:13] (03CR) 10Ayounsi: [C: 03+2] Add cloudsw2-c8-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/710534 (https://phabricator.wikimedia.org/T277340) (owner: 10Ayounsi) [14:56:54] 10SRE-Access-Requests: Requesting access to RESOURCE for @dang - https://phabricator.wikimedia.org/T288355 (10dang) [14:57:05] (03Merged) 10jenkins-bot: Add cloudsw2-c8-eqiad to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/710534 (https://phabricator.wikimedia.org/T277340) (owner: 10Ayounsi) [14:58:08] 10SRE-Access-Requests: Requesting access to RESOURCE for @dang - https://phabricator.wikimedia.org/T288355 (10RhinosF1) [14:58:59] 10SRE-Access-Requests: Requesting access to RESOURCE for @dang - https://phabricator.wikimedia.org/T288355 (10dang) [14:59:01] 10SRE-Access-Requests: Requesting access to RESOURCE for @dang - https://phabricator.wikimedia.org/T288355 (10RhinosF1) Please leave the section marked for SRE for them to fill out. You'll need to get your manager to actually comment on task to approve. (If they don't use Phab then the person on duty can co-ordi... [14:59:31] 10SRE-Access-Requests: Requesting access to RESOURCE for @dang - https://phabricator.wikimedia.org/T288355 (10dang) [15:00:02] 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RhinosF1) [15:02:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:09] (03PS1) 10Ayounsi: Add cloudsw2-c8-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/710575 (https://phabricator.wikimedia.org/T277340) [15:06:00] (03CR) 10Lucas Werkmeister (WMDE): "wmf.16 is safely rolled out by now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705857 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [15:06:24] (03CR) 10Lucas Werkmeister (WMDE): "wmf.16 is safely rolled out by now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706341 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [15:07:02] (03CR) 10Ayounsi: [C: 03+2] Add cloudsw2-c8-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/710575 (https://phabricator.wikimedia.org/T277340) (owner: 10Ayounsi) [15:07:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "DNM before wmf.17 is safely rolled out to all wikis and won’t be rolled back again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708308 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [15:08:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:09:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) Last thing to do is enable the interfaces on the cloudsw1-c8 side and it will be ready to receive servers. [15:11:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:04] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 66650568 and 782 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:12:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:13:08] (03PS1) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [15:14:48] !log removing maps1005 from old maps cassandra cluster before reimaging [15:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:04] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:15:16] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5416 and 974 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:23:29] (03PS2) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [15:37:27] (03CR) 10Bstorm: "The config server is a pretty ambitious solution. Since puppet isn't really up to the task here in terms of querying openstack and all tha" [puppet] - 10https://gerrit.wikimedia.org/r/710068 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [15:44:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:46:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:46:44] (03PS1) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [15:48:23] (03PS1) 10Hnowlan: maps: reimage maps1005 as buster imposm replica [puppet] - 10https://gerrit.wikimedia.org/r/710582 (https://phabricator.wikimedia.org/T269582) [15:51:03] (03CR) 10Majavah: metricsinfra: Add config management server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710068 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [15:52:46] (03Abandoned) 10Cathal Mooney: Exposed Netbox interface 'type' value so it can be used in templates. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/710523 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [15:53:03] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:54:51] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:59:13] (03PS1) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) [16:02:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:45] (03PS2) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) [16:09:04] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10Jclark-ctr) updated netbox with elastic host name serial cf_ticket elastic1068 JGZDKD3 T279158 elastic1069 JGYJKD3 T279158 elastic1070 JGZFKD3 T279158 elastic1071 JGZBKD3 T27915... [16:09:39] (03CR) 10Elukey: "Build the image locally, its size is around 600 MB, not really slim but I am not sure if we can trim it more :(" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:11:17] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:13:12] (03Restored) 10Cathal Mooney: Exposed Netbox interface 'type' value so it can be used in templates. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/710523 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [16:14:53] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:15:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) Thank you @Cmjohnson we would also be happy with getting thew servers in production now and later move a few of them in a separate action, if that isn't maki... [16:19:15] (03CR) 10Bstorm: [C: 03+2] metricsinfra: Add config management server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710068 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [16:19:39] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10Jclark-ctr) This host was rma damaged in shipping 2021-07-14T17:44:05.608117+00:00 Failure Device with s/n B9VVZB3 (N/A) not present in Netbox These remaining ones are at @pa... [16:23:29] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:27:40] (03PS1) 10Dzahn: site/DHCP: decom peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/710585 (https://phabricator.wikimedia.org/T288290) [16:29:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 8 days, 4:00:00 on peek2001.codfw.wmnet with reason: decom [16:29:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 4:00:00 on peek2001.codfw.wmnet with reason: decom [16:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts peek2001.codfw.wmnet [16:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:06] (03CR) 10Herron: [C: 03+2] logstash: extend ssd tier retention from 15 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/710341 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [16:34:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps1005.eqiad.wmnet with reason: Awaiting reimaging, depooled. [16:34:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps1005.eqiad.wmnet with reason: Awaiting reimaging, depooled. [16:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:37:17] (03CR) 10SBassett: [C: 03+1] site/DHCP: decom peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/710585 (https://phabricator.wikimedia.org/T288290) (owner: 10Dzahn) [16:37:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:38:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts peek2001.codfw.wmnet [16:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:59] 10SRE, 10SecTeam-Processed, 10Security: non-rw grafana does not strip CAS user header - https://phabricator.wikimedia.org/T288286 (10sbassett) [16:39:25] 10SRE, 10SecTeam-Processed, 10Security: non-rw grafana does not strip CAS user header - https://phabricator.wikimedia.org/T288286 (10sbassett) [16:41:09] (03CR) 10Dzahn: [C: 03+2] site/DHCP: decom peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/710585 (https://phabricator.wikimedia.org/T288290) (owner: 10Dzahn) [16:41:20] (03PS2) 10Dzahn: site/DHCP: decom peek2001 [puppet] - 10https://gerrit.wikimedia.org/r/710585 (https://phabricator.wikimedia.org/T288290) [16:50:23] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:51:46] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:29] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10Papaul) @Jclark-ctr thank you for the update Willy knows already about those. Those are line cards, we can not put asset tags on line cards. [16:53:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:53:42] (03PS1) 10Hnowlan: maps: reenable tilerator on maps2005 [puppet] - 10https://gerrit.wikimedia.org/r/710591 (https://phabricator.wikimedia.org/T269582) [16:56:06] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:56:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:59:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet [16:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:38] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:07:32] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 64001 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [17:10:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:12] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:33] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10wiki_willy) Hi @Jclark-ctr & @Papaul - just a heads up, if it's a linecard or something else that doesn't get an asset tag, you can just set the "AssetID" to "NA" and the "Asset... [17:17:31] ACKNOWLEDGEMENT - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 64001 MB (3% inode=99%): Hnowlan Cluster migration in progress https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [17:24:12] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:25:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:29:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:31:30] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:54] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:35:29] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] remove backup warning for config backups [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/710529 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [17:39:01] !log gitlab: run ansible to apply [[gerrit:710529|remove backup warning for config backups]] (T288324) [17:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:11] T288324: WARNING: In GitLab 14.0 we will begin removing all configuration backups older than yourgitlab_rails['backup_keep_time'] setting (currently set to: 259200) - https://phabricator.wikimedia.org/T288324 [17:42:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) Both iDracs are setup and they're accessible, needs f/w update and non data center specific work [17:51:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) [17:52:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [17:57:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [18:00:56] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Cmjohnson) IDRACs setup, the on-site work is complete. [18:03:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Cmjohnson) [18:06:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:10:30] (03CR) 10Ssingh: [C: 03+2] envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [18:16:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [18:17:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) the on-site specific work has been completed [18:17:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:17:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:19:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:19:20] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:29:26] PROBLEM - Host an-worker1139 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:41:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:24] RECOVERY - Host an-worker1139 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:47:58] (03PS1) 10Legoktm: shellbox: Add new logo, by thcipriani [deployment-charts] - 10https://gerrit.wikimedia.org/r/710597 [18:52:57] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:52:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:54] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) decom script executed and servers removed from racks for mw1261-1266 rack A5 mw1269-1275... [18:58:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:00:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:04:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:12:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:36] RECOVERY - Disk space on maps1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [19:16:25] (03PS1) 10Majavah: toolforge: add shells in /usr/bin to wheel_of_misfortune [puppet] - 10https://gerrit.wikimedia.org/r/710598 [19:17:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:21:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) [19:26:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @Dzahn the on-site work is complete for all of the servers, I moved mw1448-1450 to rack A5. I swapped the network cable for mw1444. [19:27:00] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @Ottomata an-worker1139 is officially in rack A7. All cabled up and ready for OS install [19:28:39] (03PS2) 10Ssingh: site: switch doh5002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/710360 [19:28:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:29:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:30:24] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) [19:30:30] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) [19:30:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:43] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10Cmjohnson) 05Open→03Resolved the MW servers are out of the rack, will make sure to balance power better with new servers racked in A7 [19:30:52] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:44:39] (03CR) 10Legoktm: "What about php-excimer and php-wmerrors?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [19:45:30] (03PS1) 10Addshore: admin: New ssh key for addshore (new laptop) [puppet] - 10https://gerrit.wikimedia.org/r/710605 [19:45:46] o/ :) [19:48:18] addshore: you should get a yubikey ;) [19:48:36] I have one, and will give it a go setting it up once i ditch my old laptop [19:48:54] but i fly to berlin monday and old laptop is going with me and staying there :P [19:49:29] do I need to do anything "fancy" to make someone merge that? :) [19:50:09] I put the same pub key in my home dir in deploy1002 [19:50:13] let's do a quick video call, one sec [19:50:16] oh [19:50:17] also fine [19:50:51] where'd you put it? [19:51:08] /home/addshore/adshwm-wmf-prodution-20210806_id_rsa.pub [19:53:17] (03CR) 10Legoktm: [C: 03+2] "Verified identify, addshore used his existing ssh key to put this one on deploy1002" [puppet] - 10https://gerrit.wikimedia.org/r/710605 (owner: 10Addshore) [19:53:35] ty, i'll test it and if i didnt mess up make another patch to remove ye olde one [19:53:56] a grad ssh key rotation feels kind of therapeutic [19:53:59] *grand [19:54:16] running puppet on all the bastions, it'll take a minute [19:57:47] addshore: try logging into a bastion now? [19:59:38] ya, into bast3005.wikimedia.org :) [20:00:41] (03PS1) 10Addshore: admin: Remove ssh key of old laptop for addshore [puppet] - 10https://gerrit.wikimedia.org/r/710606 [20:00:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:03] (03CR) 10Legoktm: [C: 03+2] admin: Remove ssh key of old laptop for addshore [puppet] - 10https://gerrit.wikimedia.org/r/710606 (owner: 10Addshore) [20:10:38] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:12:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:16:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:17:41] (03PS1) 10Legoktm: shellbox: Disable php-fpm slowlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/710607 (https://phabricator.wikimedia.org/T288315) [20:19:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:40:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:42:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:53:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:55:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:03:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:06:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:11:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:11:45] 10SRE, 10Wikimedia-Mailing-lists, 10wikimedia.biterg.io: Mailing lists statistics on wikimedia.biterg.io broken since move from pipermail to hyperkitty (non-public API) - https://phabricator.wikimedia.org/T288369 (10Aklapper) Thanks for the quick investigation and additional info; I've forwarded this to Bite... [21:14:44] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:27:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:29:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:08:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:21:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:22:00] (03CR) 10Jeena Huneidi: [C: 04-1] toolhub: initial chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [22:23:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:41:47] (03PS1) 10Cwhite: hiera: add observability role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/710617 [22:42:27] (03CR) 10Cwhite: [C: 03+1] global: linted and added vim files to gitignore [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710498 (owner: 10David Caro) [22:42:54] (03CR) 10Cwhite: [C: 03+1] am: added main function tests and small refactor [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [22:43:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:47:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:51:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:52:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:55:16] (03PS1) 10Ahmon Dancy: fpm-multiversion-base: Add php-excimer and php-wmerrors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) [22:56:41] (03PS2) 10Ahmon Dancy: fpm-multiversion-base: Add php-excimer and php-wmerrors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) [23:02:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:04:12] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:35:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:37:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:43:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:44:10] (03PS4) 10BryanDavis: toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) [23:45:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:45:20] (03CR) 10jerkins-bot: [V: 04-1] toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [23:50:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:56:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down