[00:13:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:15:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:17:49] (03PS3) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [00:17:51] (03PS1) 10BryanDavis: toolhub: Generate README.md with helm-docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/715146 [00:22:37] (03CR) 10BryanDavis: [C: 03+2] toolhub: Generate README.md with helm-docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/715146 (owner: 10BryanDavis) [00:25:44] (03PS1) 10Legoktm: Revert "Update backbone.js and underscore.js" [extensions/PageTriage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715087 (https://phabricator.wikimedia.org/T289825) [00:25:49] (03CR) 10Legoktm: [C: 03+2] Revert "Update backbone.js and underscore.js" [extensions/PageTriage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715087 (https://phabricator.wikimedia.org/T289825) (owner: 10Legoktm) [00:25:55] (03Merged) 10jenkins-bot: toolhub: Generate README.md with helm-docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/715146 (owner: 10BryanDavis) [00:28:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:44] (03CR) 10jerkins-bot: [V: 04-1] Revert "Update backbone.js and underscore.js" [extensions/PageTriage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715087 (https://phabricator.wikimedia.org/T289825) (owner: 10Legoktm) [00:32:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:40:40] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Revert "Update backbone.js and underscore.js" [extensions/PageTriage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715087 (https://phabricator.wikimedia.org/T289825) (owner: 10Legoktm) [00:44:14] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/PageTriage/: Revert backbone.js and underscore.js updates (T289825) (duration: 01m 06s) [00:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:19] T289825: NPP Feed broken - https://phabricator.wikimedia.org/T289825 [00:44:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:15] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:54:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:05:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:07:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:10:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:14:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:18:15] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04965 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [01:20:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:22:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:27:35] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2184 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [01:34:47] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [01:36:35] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [01:39:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:40:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:48:13] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.0219 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [01:56:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:58:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:07:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:11:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:17:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:19:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:24:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:30:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:42:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:47:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:48:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:49:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:51:51] (03CR) 10Cwhite: wmflib: add 'aka' to Service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [02:52:45] (03CR) 10Cwhite: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/715032 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [02:54:10] (03CR) 10Cwhite: [C: 03+1] prometheus: remove alerts moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/715033 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [02:59:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:03:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:29:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:10:29] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:21:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:27:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:27:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:36:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:38:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:48:51] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:09:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:00] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) >>! In T209149#730632... [05:24:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:41] (03PS2) 10Volans: wmcs.wikireplicas.add_wiki: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) [05:42:59] (03CR) 10jerkins-bot: [V: 04-1] wmcs.wikireplicas.add_wiki: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [05:49:06] (03PS6) 10Ryan Kemper: Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [05:51:31] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [05:57:36] ryankemper: FYI the CI failures are due to the new prospector 1.4.0 that was released yesterday, I'll send a fix shortly [05:57:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:01:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:03:41] (03PS7) 10Ryan Kemper: Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [06:04:24] volans: ah that explains it :) [06:04:27] cool [06:06:09] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [06:13:09] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:18:53] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:21:01] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:22:52] (03PS1) 10Volans: Fix newly reported pylint errors [cookbooks] - 10https://gerrit.wikimedia.org/r/715156 [06:23:23] ryankemper: ^^^ (the wmcs one will implicitely be resolved by gerrit 714798 that I'm about to merge) [06:23:55] ack, thanks [06:24:23] I'll rebase yours once done [06:25:22] (03CR) 10jerkins-bot: [V: 04-1] Fix newly reported pylint errors [cookbooks] - 10https://gerrit.wikimedia.org/r/715156 (owner: 10Volans) [06:26:43] (03CR) 10Volans: [V: 03+2 C: 03+2] "Overriding CI as the only failure is on a subtree that will be removed in Ib9b7c80143eeac7ca402ab88b7da231bf0983c2f in few minutes." [cookbooks] - 10https://gerrit.wikimedia.org/r/715156 (owner: 10Volans) [06:27:00] (03PS2) 10Volans: Fix newly reported pylint errors [cookbooks] - 10https://gerrit.wikimedia.org/r/715156 [06:27:04] (03CR) 10Volans: [V: 03+2 C: 03+2] Fix newly reported pylint errors [cookbooks] - 10https://gerrit.wikimedia.org/r/715156 (owner: 10Volans) [06:27:59] (03PS3) 10Volans: wmcs.wikireplicas.add_wiki: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) [06:29:10] (03CR) 10Volans: [V: 03+2 C: 03+2] "CI is failing because the new prospector released yesterday has a new pylint that is reporting some new issue in the wmcs/ subtree. As it " [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [06:29:38] (03PS2) 10Volans: wmcs: remove wmcs/ subtree [cookbooks] - 10https://gerrit.wikimedia.org/r/714798 (https://phabricator.wikimedia.org/T287465) [06:29:44] (03PS1) 10Tim Starling: sendMail.php improvements [extensions/SecurePoll] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715089 [06:29:51] (03CR) 10Volans: [C: 03+2] admin: update sudo rule for renamed cookbook [puppet] - 10https://gerrit.wikimedia.org/r/714799 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [06:30:26] (03CR) 10Tim Starling: [C: 03+2] sendMail.php improvements [extensions/SecurePoll] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715089 (owner: 10Tim Starling) [06:33:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:27] (03CR) 10Volans: [C: 03+2] wmcs: remove wmcs/ subtree [cookbooks] - 10https://gerrit.wikimedia.org/r/714798 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [06:35:21] (03Merged) 10jenkins-bot: sendMail.php improvements [extensions/SecurePoll] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715089 (owner: 10Tim Starling) [06:37:08] (03Merged) 10jenkins-bot: wmcs: remove wmcs/ subtree [cookbooks] - 10https://gerrit.wikimedia.org/r/714798 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [06:37:58] (03PS8) 10Volans: Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [06:40:51] ryankemper: alll yours, rebased and passing CI :D [06:41:11] I did revert the fix for the pylint error as I had already fixed it in the previous patch [06:41:43] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 (10Volans) The above patches have all been merged and deployed. The add_wiki cookbook is now available as `sre.w... [06:41:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:48] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/SecurePoll/cli/wm-scripts/sendMail.php: (no justification provided) (duration: 00m 56s) [06:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:29] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.4018 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [06:53:07] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:55:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:56:37] the logstash alarm seems to be related to logstash1008, udp traffic errors [06:56:48] and it is going down [06:56:57] already happened some hours ago for a longer period of time [06:57:30] we should improve the wikitech links on these alarms to a more actionable set of runbook entries [06:58:49] godog: around?? [06:59:46] elukey: yeah, usually a restart takes care of it, I'll do it [06:59:53] can't wait to get rid of elk5 [06:59:57] godog: ah ok! <3 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210827T0700) [07:00:59] !log bounce logstash on logstash1008 [07:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:00] also agreed re: better runbook links [07:06:27] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [07:09:42] (03PS1) 10Filippo Giunchedi: logstash: more specific link to udp packet loss runbook [puppet] - 10https://gerrit.wikimedia.org/r/715192 [07:09:45] elukey: ^ [07:10:29] (03PS1) 10Legoktm: Fix $wgShellboxUrls for Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 [07:10:33] (03CR) 10Elukey: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/715192 (owner: 10Filippo Giunchedi) [07:12:15] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: more specific link to udp packet loss runbook [puppet] - 10https://gerrit.wikimedia.org/r/715192 (owner: 10Filippo Giunchedi) [07:13:40] (03PS1) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 [07:14:15] (03Abandoned) 10Legoktm: Drop $wmgUseScoreShellbox, redundant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713727 (owner: 10Legoktm) [07:19:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:03] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove alerts moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/715033 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [07:21:10] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: add prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/715032 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [07:21:19] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:21:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:24:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:28:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:30] (03CR) 10Vgutierrez: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [07:35:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Modify homer/automation templates to support 100BaseTX interfaces with autoneg disabled. - https://phabricator.wikimedia.org/T288343 (10cmooney) p:05Triage→03Low [07:35:29] 10SRE, 10Infrastructure-Foundations, 10netops: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10cmooney) p:05Triage→03Medium [07:37:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:50] (03PS1) 10Volans: Address newly reported pylint issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 [07:38:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) p:05Triage→03Low [07:38:45] 10SRE, 10Infrastructure-Foundations, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10cmooney) p:05Triage→03Medium [07:40:07] (03PS1) 10Filippo Giunchedi: Fix dnspython 2 compat [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) [07:43:16] (03CR) 10jerkins-bot: [V: 04-1] Address newly reported pylint issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [07:43:27] (03PS2) 10Filippo Giunchedi: Fix dnspython 2 compat [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) [07:46:43] (03CR) 10Cathal Mooney: [C: 03+2] "Merging to update repo." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/710523 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [07:46:51] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Exposed Netbox interface 'type' value so it can be used in templates. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/710523 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [07:47:24] (03CR) 10Gehel: "minor comments inline. Thanks for keeping our code up to date!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [07:48:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:00] !log stopped kube-apiserver on kubestage2001 for testing [07:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:10] !log stopped kube-apiserver on kubestagemaster2001 for testing [07:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:46] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: should we move $site global to a fact - https://phabricator.wikimedia.org/T289678 (10fgiunchedi) I like the idea of the datatype and having one list of sites we consider valid! I made the case on the review for why sticking with site is important, report... [08:00:34] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10jcrespo) Hey, @JAllemandou, I thought you meant a mailman list, which for the most part are self-managed by each list administration by each mailing list owner. However, I cannot see any list (eve... [08:00:56] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10jcrespo) p:05Triage→03High a:03jcrespo [08:01:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:03:21] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:03:51] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10jcrespo) It is indeed a mail server alias, so I will be able to help you :-). I will see if there is any other cleanup needed, and check & update offboarding procedures to avoid this from recurrin... [08:04:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:05:09] (03PS1) 10Volans: Fix newly reported pylint issues [software/cumin] - 10https://gerrit.wikimedia.org/r/715202 [08:05:12] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/SecurePoll/cli/wm-scripts/sendMail.php: (no justification provided) (duration: 00m 56s) [08:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:24] (03CR) 10Filippo Giunchedi: "A good starting point, though we'll be losing information as it stands, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [08:11:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:17:59] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10fgiunchedi) Thank you @Dzahn and @jcrespo, I agree short term a) or b) sound good to me and likely the way to go. Perhap... [08:23:19] (03CR) 10Gehel: "question inline (I'm mostly just curious)" [software/cumin] - 10https://gerrit.wikimedia.org/r/715202 (owner: 10Volans) [08:24:05] (03CR) 10Volans: "Thanks for the feedback! Replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [08:24:08] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10jcrespo) d) will require T244840 @MoritzMuehlenhoff or @Volans will know how far ahead the work is there. [08:27:48] (03CR) 10Gehel: Address newly reported pylint issues (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [08:35:12] (03PS1) 10Jcrespo: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) [08:35:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:41] (03PS2) 10Jcrespo: admin: Add SimoneThisDot to the list of ldap-only-users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) [08:37:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:41:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:49:33] (03CR) 10Ladsgroup: [C: 03+1] "Looks good. A nitpick on the commit msg." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm) [08:51:15] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:53:13] (03PS1) 10Cathal Mooney: Updated build for buster and bullseye to integrate change exposing interface type from Netbox to Homer. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/715205 (https://phabricator.wikimedia.org/T288343) [08:55:24] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/715205 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [08:56:33] (03PS2) 10Legoktm: Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 [08:56:50] (03CR) 10Legoktm: Don't set default $wgShellboxUrls to Score (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm) [08:58:45] (03CR) 10Kormat: [C: 03+2] mariadb: add section to alert name [puppet] - 10https://gerrit.wikimedia.org/r/715043 (owner: 10Volans) [08:58:47] (03CR) 10Ladsgroup: [C: 03+1] Don't set default $wgShellboxUrls to Score (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm) [08:58:50] (03CR) 10Kormat: [C: 03+2] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/715043 (owner: 10Volans) [09:00:49] (03CR) 10Alexandros Kosiaris: wmflib: add 'aka' to Service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [09:08:18] (03CR) 10Cathal Mooney: [C: 03+2] "Merging changes." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/715205 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [09:08:21] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Updated build for buster and bullseye to integrate change exposing interface type from Netbox to Homer. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/715205 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [09:12:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [09:12:49] (03CR) 10Jcrespo: [C: 03+2] "Thank you very much, Jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/715204 (https://phabricator.wikimedia.org/T289783) (owner: 10Jcrespo) [09:14:39] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:15:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:19:18] 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10AbbanWMDE) [09:21:56] !log cmooney@deploy1002 Started deploy [homer/deploy@8183056]: Homer update exposing interface type from Netbox - T288343 [09:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:02] T288343: Modify homer/automation templates to support 100BaseTX interfaces with autoneg disabled. - https://phabricator.wikimedia.org/T288343 [09:23:24] !log cmooney@deploy1002 Finished deploy [homer/deploy@8183056]: Homer update exposing interface type from Netbox - T288343 (duration: 01m 28s) [09:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:03] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Update to expose int type from Netbox - cmooney@cumin1001 [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Update to expose int type from Netbox - cmooney@cumin1001 [09:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [09:31:22] (03CR) 10Cathal Mooney: [C: 03+2] Change to interface templates for mr routers. [homer/public] - 10https://gerrit.wikimedia.org/r/710525 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [09:32:02] (03Merged) 10jenkins-bot: Change to interface templates for mr routers. [homer/public] - 10https://gerrit.wikimedia.org/r/710525 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [09:32:34] (03PS18) 10JMeybohm: kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) [09:33:42] (03CR) 10JMeybohm: kubernetes::node: Make use of the disk_type fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [09:33:45] !log Running homer against mr1-ulsfo to force OOB interface to 100Mb/full-duplex - T288343 [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:50] T288343: Modify homer/automation templates to support 100BaseTX interfaces with autoneg disabled. - https://phabricator.wikimedia.org/T288343 [09:36:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:37:58] (03PS2) 10Volans: Address newly reported pylint issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 [09:39:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:13] (03PS1) 10Elukey: roles::ores: move celery and cache to rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/715209 [09:41:00] effie: --^ [09:42:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30888/console" [puppet] - 10https://gerrit.wikimedia.org/r/715209 (owner: 10Elukey) [09:42:54] elukey: shall I merge? [09:43:29] effie: I preferred a sanity check +1 [09:43:46] if all is ok I'll merge + restart ores [09:43:49] (in codfw) [09:43:53] so you'll be free to go [09:45:09] (03CR) 10Klausman: [C: 03+1] roles::ores: move celery and cache to rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/715209 (owner: 10Elukey) [09:45:28] (03CR) 10Elukey: [V: 03+1 C: 03+2] roles::ores: move celery and cache to rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/715209 (owner: 10Elukey) [09:46:26] elukey ok ! [09:46:49] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1001/30887/" [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [09:48:10] (03PS2) 10Filippo Giunchedi: wmflib: add 'aliases' to Service [puppet] - 10https://gerrit.wikimedia.org/r/714965 [09:48:12] (03PS2) 10Filippo Giunchedi: hieradata: add aliases for a few services [puppet] - 10https://gerrit.wikimedia.org/r/714966 [09:48:14] (03PS2) 10Filippo Giunchedi: pontoon: extend service_names to include aliases [puppet] - 10https://gerrit.wikimedia.org/r/714968 [09:49:44] !log restart ores uwsgi/celery workers to failover rdb2007 to rdb2008 (and ease the reboot of rdb2007 [09:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:57] (03CR) 10Filippo Giunchedi: "Fair enough! Thanks for the comments, I've changed aka to aliases" [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [09:54:19] (03CR) 10Volans: Address newly reported pylint issues (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [09:54:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:35] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:01:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:04:28] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) https://en.wikipedia.org/wiki/File:Logo_of_the_International_Practical_Shooting_Confe... [10:05:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:45] (03PS2) 10Volans: Fix newly reported pylint issues [software/cumin] - 10https://gerrit.wikimedia.org/r/715202 [10:12:44] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [10:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:14:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:00] (03PS1) 10Elukey: Revert "roles::ores: move celery and cache to rdb2008" [puppet] - 10https://gerrit.wikimedia.org/r/715091 [10:18:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [10:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:57] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:27] (03CR) 10Elukey: [C: 03+2] Revert "roles::ores: move celery and cache to rdb2008" [puppet] - 10https://gerrit.wikimedia.org/r/715091 (owner: 10Elukey) [10:22:41] !log fallback codfw ores to rdb2007 after maintenance [10:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:25] (03PS1) 10Cathal Mooney: Fixed error in Jinja2 template for hardcoded speed/duplex on mr routers. [homer/public] - 10https://gerrit.wikimedia.org/r/715211 (https://phabricator.wikimedia.org/T288343) [10:25:41] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1081 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash%23UDP_packet_loss https://grafana.wikimedia.org/dashboard/db/logstash [10:29:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:43] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/715211 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [10:30:33] (03CR) 10Cathal Mooney: [C: 03+2] Fixed error in Jinja2 template for hardcoded speed/duplex on mr routers. [homer/public] - 10https://gerrit.wikimedia.org/r/715211 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [10:31:09] (03Merged) 10jenkins-bot: Fixed error in Jinja2 template for hardcoded speed/duplex on mr routers. [homer/public] - 10https://gerrit.wikimedia.org/r/715211 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [10:31:25] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2615 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash%23UDP_packet_loss https://grafana.wikimedia.org/dashboard/db/logstash [10:31:50] !log bounce logstash on logstash1007 [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:23] (03CR) 10Klausman: [C: 03+1] Revert "roles::ores: move celery and cache to rdb2008" [puppet] - 10https://gerrit.wikimedia.org/r/715091 (owner: 10Elukey) [10:37:01] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash%23UDP_packet_loss https://grafana.wikimedia.org/dashboard/db/logstash [10:38:16] (03PS1) 10Cathal Mooney: Missing 'm' in speed command causing JunOS error. [homer/public] - 10https://gerrit.wikimedia.org/r/715214 (https://phabricator.wikimedia.org/T288343) [10:38:41] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/715214 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [10:38:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:39:22] (03CR) 10Cathal Mooney: [C: 03+2] Missing 'm' in speed command causing JunOS error. [homer/public] - 10https://gerrit.wikimedia.org/r/715214 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [10:40:07] (03Merged) 10jenkins-bot: Missing 'm' in speed command causing JunOS error. [homer/public] - 10https://gerrit.wikimedia.org/r/715214 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [10:43:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:44:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) 05Resolved→03Open While rolling out the lldp factupdate i noticed an some machines have ip_forwarding enabled. this is likle... [10:45:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:48:04] (03PS1) 10Kosta Harlan: stretch-sssd: Add openssh-client [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) [10:50:01] (03PS1) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 [10:51:40] (03PS1) 10Jbond: profile::sysctl: add ability to control ip_forward: [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) [10:53:10] (03CR) 10jerkins-bot: [V: 04-1] profile::sysctl: add ability to control ip_forward: [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [10:54:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30889/console" [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [10:54:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:42] !log sudo cumin 'mw*' 'ip ro ls dev docker0 && sysctl net.ipv4.ip_forward=0' to clear up the docker remnants of the dragonfly evaluation. T286054 [10:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:48] T286054: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 [10:57:54] (03PS7) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [11:03:07] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:03:30] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jcrespo) 05Open→03Resolved The grants have been deployed, https://ldap.toolforge.org/user/simone-this-dot @SimoneThisDot you should have now acceess to logstash, please test it and reope... [11:06:04] (03PS2) 10Jbond: profile::sysctl: add ability to control ip_forward: [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) [11:07:05] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30890/console" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:08:30] (03CR) 10Hnowlan: [V: 03+1] postgresql::user: split HBA configuration into a different define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:09:33] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10jcrespo) [11:15:57] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10jcrespo) a:03odimitrijevic The intention here is to provide UNIX filesystem access to the analytics servers, by the user being present on the righ... [11:16:18] 10SRE, 10Maps, 10Product-Data-Infrastructure (Backlog): Maps postgres read replicas throws errors on eqiad - https://phabricator.wikimedia.org/T289852 (10Jgiannelos) [11:18:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:14] 10SRE, 10Maps, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Data-Infrastructure (Backlog): Maps postgres read replicas throws errors on eqiad - https://phabricator.wikimedia.org/T289852 (10hnowlan) a:03hnowlan [11:20:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:25:02] (03CR) 10Jbond: [C: 04-1] "Ip forwarding is often enabled at runtime by daemons that need it e.g. docker/kubelet. This can mean that the puppet config and the actua" [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [11:25:17] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:28:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:26] (03PS1) 10Dzahn: create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) [11:38:57] (03PS37) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [11:39:31] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) >>! In T165885#7311068, @elukey wrote: > @jbond @Dzahn I got bitten by this problem in production 2/3 times as well (tod... [11:47:14] (03PS4) 10Dzahn: static-bugzilla: add uncompressed HTML for the first 100 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/714460 (https://phabricator.wikimedia.org/T281538) [11:47:46] (03CR) 10Jbond: "LGTM to nits" [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [11:50:20] (03PS1) 10Cathal Mooney: Adding stanza for 'gigether-options' to explicitly deactivate autonegotiation on MR/SRX interfaces manually set to 100Mb/full. [homer/public] - 10https://gerrit.wikimedia.org/r/715223 (https://phabricator.wikimedia.org/T288343) [11:50:55] (03CR) 10Cathal Mooney: [C: 03+2] Adding stanza for 'gigether-options' to explicitly deactivate autonegotiation on MR/SRX interfaces manually set to 100Mb/full. [homer/public] - 10https://gerrit.wikimedia.org/r/715223 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [11:51:26] (03Merged) 10jenkins-bot: Adding stanza for 'gigether-options' to explicitly deactivate autonegotiation on MR/SRX interfaces manually set to 100Mb/full. [homer/public] - 10https://gerrit.wikimedia.org/r/715223 (https://phabricator.wikimedia.org/T288343) (owner: 10Cathal Mooney) [11:51:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:53:05] (03CR) 10Jgiannelos: [C: 03+1] maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 (owner: 10MSantos) [11:53:32] (03CR) 10Dzahn: create a generic class to clean the puppet client bucket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [11:53:34] (03PS1) 10Urbanecm: Add some missing edit*protected rights to $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715224 [11:53:45] (03PS2) 10Dzahn: create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) [11:55:38] (03CR) 10Dzahn: [C: 03+2] static-bugzilla: add uncompressed HTML for the first 100 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/714460 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [11:56:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Modify homer/automation templates to support 100BaseTX interfaces with autoneg disabled. - https://phabricator.wikimedia.org/T288343 (10cmooney) Sorted it eventually :) ` cmooney@mr1-ulsfo> show interfaces ge-0/0/0... [11:56:49] (03Merged) 10jenkins-bot: static-bugzilla: add uncompressed HTML for the first 100 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/714460 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [11:57:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Modify homer/automation templates to support 100BaseTX interfaces with autoneg disabled. - https://phabricator.wikimedia.org/T288343 (10cmooney) 05Open→03Resolved [11:57:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:58:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:13] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:04:31] !log removing peering to Wave Division Holdings / AS11404 at Equinix Chicago cr2-eqord, AS no longer on exchange. [12:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:33] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30892/console" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:18:27] (03PS38) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [12:18:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:25] (03PS1) 10JMeybohm: Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/715227 (https://phabricator.wikimedia.org/T289766) [12:36:04] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10jbond) no issue with the change, however @elukey looking at an-launcher1002 there is 22GB of space free, if the filebucket is g... [12:36:54] (03CR) 10JMeybohm: [C: 03+2] kubernetes::node: Make use of the disk_type fact [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [12:40:17] (03PS1) 10Jbond: P:configmaster: don't backup known-hosts and fingerprint files [puppet] - 10https://gerrit.wikimedia.org/r/715228 (https://phabricator.wikimedia.org/T165885) [12:40:53] (03CR) 10Nikki Nikkhoui: Helmfile for image suggestion api (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [12:41:00] (03PS6) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [12:41:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30893/console" [puppet] - 10https://gerrit.wikimedia.org/r/715228 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [12:43:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:45:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:45:56] (03PS1) 10Jbond: P:dns::auth::config: use backup false for GeoIP2-City.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/715230 (https://phabricator.wikimedia.org/T165885) [12:46:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:configmaster: don't backup known-hosts and fingerprint files [puppet] - 10https://gerrit.wikimedia.org/r/715228 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [12:46:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30894/console" [puppet] - 10https://gerrit.wikimedia.org/r/715230 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [12:48:41] (03PS2) 10JMeybohm: Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/715227 (https://phabricator.wikimedia.org/T289766) [12:49:56] !log rsynced /srv/org/wikimedia/racktables from miscweb1002 to miscweb2002 (T269746) [12:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:01] T269746: install racktables on miscweb2002 - https://phabricator.wikimedia.org/T269746 [12:50:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:51:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:dns::auth::config: use backup false for GeoIP2-City.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/715230 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [12:53:12] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10jbond) I have fixed the issues on authdns and puppetmaster [12:53:35] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:54:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:57:46] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10jbond) @elukey next time you see the issue on an-launcher1002 can you run the two lines used above (for authdns and puppetmaste... [12:59:33] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:01:36] (03PS1) 10Dzahn: racktables: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/715231 [13:03:35] (03PS1) 10Mvolz: Update Zotero to c4d40f374d2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/715232 [13:05:17] (03PS1) 10Dzahn: racktables: remove work around for missing install in codfw [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) [13:06:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [13:06:25] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:06:45] (03CR) 10Dzahn: "[miscweb2002:~] $ file /srv/org/wikimedia/racktables/wwwroot/inc/auth.php" [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) (owner: 10Dzahn) [13:06:47] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:07:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/715202 (owner: 10Volans) [13:07:30] (03PS2) 10Dzahn: racktables: remove work around for missing install in codfw [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) [13:09:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:11:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:12:23] (03CR) 10Jbond: kubernetes::node: Make use of the disk_type fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714962 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [13:13:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/715231 (owner: 10Dzahn) [13:13:17] ACKNOWLEDGEMENT - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 571 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map Luca Toscano T267714 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:13:17] ACKNOWLEDGEMENT - Host ripe-atlas-codfw is DOWN: PING CRITICAL - Packet loss = 100% Luca Toscano T267714 [13:13:17] ACKNOWLEDGEMENT - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 497 probes of 497 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map Luca Toscano T267714 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:13:17] ACKNOWLEDGEMENT - Host ripe-atlas-codfw IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:201:208:80:152:244) Luca Toscano T267714 [13:13:19] (03CR) 10Jcrespo: [C: 03+1] "This seems reasonable to me, but please ping DBAs on the ticket (Manuel, Kormat), as they will want to be aware of misc db usage changes/e" [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) (owner: 10Dzahn) [13:14:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) (owner: 10Dzahn) [13:14:29] (03CR) 10Dzahn: [C: 03+2] racktables: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/715231 (owner: 10Dzahn) [13:16:27] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) [13:17:33] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) [13:17:42] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [13:19:00] (03CR) 10Dzahn: [C: 03+2] racktables: remove work around for missing install in codfw [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) (owner: 10Dzahn) [13:19:06] (03PS3) 10Dzahn: racktables: remove work around for missing install in codfw [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) [13:22:00] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10elukey) @jbond sure! [13:25:42] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud cumin: exclude certain projects from "A:all" - https://phabricator.wikimedia.org/T289706 (10nskaggs) p:05Triage→03Low [13:27:05] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: should we move $site global to a fact - https://phabricator.wikimedia.org/T289678 (10jbond) Thanks @fgiunchedi i had net to go over thoses comments and update, and from the comments it was agreeaded to use site which im happy)ish) with. for posperity io... [13:27:47] (03CR) 10Dzahn: [C: 03+2] racktables: remove work around for missing install in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715233 (https://phabricator.wikimedia.org/T269746) (owner: 10Dzahn) [13:32:34] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7311810, @Jelto wrote: > There is a ClusterRole named `deploy` already for the aggregation of `view` and `pods/portForward` permissions. So I would prefer using... [13:35:16] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 (10nskaggs) @Volans, as far as I can tell, I can no longer use this cookbook. The sudo rule no longer seems to work. [13:39:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:39:42] (03PS1) 10Dzahn: miscweb: bump staging version to 2021-08-27-115701-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715236 (https://phabricator.wikimedia.org/T281538) [13:40:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:43:05] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging version to 2021-08-27-115701-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715236 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [13:45:38] (03Merged) 10jenkins-bot: miscweb: bump staging version to 2021-08-27-115701-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715236 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [13:46:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:46:48] (03PS1) 10Dzahn: miscweb: bump production version to 2021-08-27-115701-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) [13:48:21] (03PS2) 10Jbond: lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) [13:48:24] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:12] (03CR) 10jerkins-bot: [V: 04-1] lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [13:52:52] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 560638672408 and 247308 seconds Hnowlan Awaiting resync from master. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:52:52] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 511612629592 and 193307 seconds Hnowlan Awaiting resync from master. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:52:52] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 511226751280 and 193311 seconds Hnowlan Awaiting resync from master. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:52:52] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 510914279496 and 193326 seconds Hnowlan Awaiting resync from master. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:52:52] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 510914279496 and 193332 seconds Hnowlan Awaiting resync from master. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:00:21] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:09:21] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10jcrespo) I ran the following: ` $ mwscript purgeChangedFiles.php --wiki=commonswiki --starttime=2021051112254... [14:09:34] (03PS1) 10Alexandros Kosiaris: Bump memory resources for CI by 2x [deployment-charts] - 10https://gerrit.wikimedia.org/r/715240 (https://phabricator.wikimedia.org/T289737) [14:15:09] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:15:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump memory resources for CI by 2x [deployment-charts] - 10https://gerrit.wikimedia.org/r/715240 (https://phabricator.wikimedia.org/T289737) (owner: 10Alexandros Kosiaris) [14:16:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:18:17] (03Merged) 10jenkins-bot: Bump memory resources for CI by 2x [deployment-charts] - 10https://gerrit.wikimedia.org/r/715240 (https://phabricator.wikimedia.org/T289737) (owner: 10Alexandros Kosiaris) [14:18:46] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10jcrespo) I'm silly, the issue was on swift, not on cache: ` curl -I http://ms-fe.svc.eqiad.wmnet/wikipedia/com... [14:19:02] (03PS3) 10Jbond: lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) [14:19:04] (03PS1) 10Jbond: lldp: update lldp_parent to use lldp['parent] [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) [14:19:50] (03CR) 10jerkins-bot: [V: 04-1] lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [14:19:59] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) @MarkTraceur @CBogen @Tgr can someone investigate https://upload.wikimedia.org/wikipe... [14:20:09] (03CR) 10jerkins-bot: [V: 04-1] lldp: update lldp_parent to use lldp['parent] [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [14:22:29] (03PS4) 10Jbond: lldp fact: add new parent key to lldp [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) [14:23:25] (03PS2) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 [14:24:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:25:14] (03PS3) 10MSantos: maps: add wikidata polygon table and script fixes [puppet] - 10https://gerrit.wikimedia.org/r/715216 [14:30:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master [14:30:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master [14:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:44] (03PS1) 10Alexandros Kosiaris: Specify cpu too. Fixup for 6da39d51d061ea5017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/715244 (https://phabricator.wikimedia.org/T289737) [14:33:57] (03PS1) 10Andrew Bogott: cloud-cumin: exclude trove project from cumin runs [puppet] - 10https://gerrit.wikimedia.org/r/715245 (https://phabricator.wikimedia.org/T289706) [14:37:21] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [14:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:53] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:38:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Specify cpu too. Fixup for 6da39d51d061ea5017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/715244 (https://phabricator.wikimedia.org/T289737) (owner: 10Alexandros Kosiaris) [14:38:11] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [14:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [14:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:37] (03Merged) 10jenkins-bot: Specify cpu too. Fixup for 6da39d51d061ea5017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/715244 (https://phabricator.wikimedia.org/T289737) (owner: 10Alexandros Kosiaris) [14:41:27] RECOVERY - Long running screen/tmux on maps2004 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:43:57] (03CR) 10Jbond: "Thanks for the PS and lgtm however it wont work as expected and may cause confusion." [puppet] - 10https://gerrit.wikimedia.org/r/713615 (owner: 10MVernon) [14:44:07] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [14:44:08] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [14:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:16] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [14:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:35] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [14:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:51] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/714862 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [14:44:58] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10JAllemandou) Thanks a lot @jcrespo :) [14:45:00] (03PS2) 10Jbond: lldp: update lldp_parent to use lldp['parent] [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) [14:45:19] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:06] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:31] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [14:50:56] !log stop flink on staging cluster to verify some IOPS starvation issues [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:35] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:54:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:55:28] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [14:55:31] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:57:23] (03PS2) 10MSantos: maps: bump kartotherian PG query timeout [puppet] - 10https://gerrit.wikimedia.org/r/711555 [14:57:47] (03CR) 10MSantos: maps: bump kartotherian PG query timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711555 (owner: 10MSantos) [14:57:50] (03CR) 10jerkins-bot: [V: 04-1] maps: bump kartotherian PG query timeout [puppet] - 10https://gerrit.wikimedia.org/r/711555 (owner: 10MSantos) [14:58:12] (03PS3) 10MSantos: maps: bump kartotherian PG query timeout [puppet] - 10https://gerrit.wikimedia.org/r/711555 [14:59:21] (03Abandoned) 10MSantos: maps: restore tilerator cpu ratio to 0.3 [puppet] - 10https://gerrit.wikimedia.org/r/711554 (owner: 10MSantos) [15:07:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:10:31] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) Papaul rebooted yesterday with the USB key present. Things //appeared// to go ok, on the serial console the device went into a Linux... [15:13:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:18:19] (03CR) 10ZPapierski: [C: 03+1] flink-session-cluster: Add support for elastic ECS logger [deployment-charts] - 10https://gerrit.wikimedia.org/r/714997 (https://phabricator.wikimedia.org/T289275) (owner: 10DCausse) [15:18:23] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:20:19] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:27:57] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10jcrespo) I've deployed the following change, things should apply in the following 30 minutes or so. `lang=diff --- a/modules/privateexim/files/wikimedia.org +++ b/modules/privateexim/files/wikimed... [15:29:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:29:55] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:34:49] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:35:41] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:36:14] 10SRE, 10Analytics: Remove fdans from analytics-alerts mailing list - https://phabricator.wikimedia.org/T289807 (10jcrespo) 05Open→03Resolved I've updated https://wikitech.wikimedia.org/w/index.php?title=SRE_Offboarding&type=revision&diff=1923454&oldid=1903359 to reflect reality- some changes were needed f... [15:37:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:38:24] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner crashed: TypeError: unsupported operand type(s) for +: 'NoneType' and 'datetime.timedelta' - https://phabricator.wikimedia.org/T288880 (10jcrespo) p:05Triage→03Low I am going to guess this falls into the "followups with low prio unless it happens aga... [15:39:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:37] 10SRE, 10Infrastructure-Foundations, 10netops: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10jcrespo) Commenting as I think @ayounsi will not have been CCed on the original Phab report, for him to triage. [15:46:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:48:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:01] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) 05Open→03Stalled [15:50:32] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) [15:51:38] (03PS1) 10Andrew Bogott: Switch redis-tools to 'present' rather than 'latest' on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/715253 (https://phabricator.wikimedia.org/T289867) [15:53:13] (03PS2) 10Andrew Bogott: Switch redis-tools to 'present' rather than 'latest' on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/715253 (https://phabricator.wikimedia.org/T289867) [15:54:07] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:55:10] (03CR) 10Majavah: [C: 04-1] "Stretch is oldoldstable at this point, we certainly want the package for newer (buster and bullseye) distros too. One option too would be " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) (owner: 10Kosta Harlan) [15:56:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:00:29] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:02:23] (03CR) 10Nskaggs: [C: 04-1] "Unfortunately, I don't believe this will fix things as the libc dependency won't be met by stretch backports. We also don't want to inadve" [puppet] - 10https://gerrit.wikimedia.org/r/715253 (https://phabricator.wikimedia.org/T289867) (owner: 10Andrew Bogott) [16:03:51] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10JAllemandou) p:05High→03Medium [16:05:27] (03CR) 10Nskaggs: [C: 03+1] "I misunderstood your intent here, I'm sorry. Yes, this should make puppet pass again. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715253 (https://phabricator.wikimedia.org/T289867) (owner: 10Andrew Bogott) [16:06:09] (03CR) 10Andrew Bogott: [C: 03+2] Switch redis-tools to 'present' rather than 'latest' on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/715253 (https://phabricator.wikimedia.org/T289867) (owner: 10Andrew Bogott) [16:13:51] PROBLEM - Disk space on maps2009 is CRITICAL: DISK CRITICAL - free space: / 2709 MB (3% inode=98%): /tmp 2709 MB (3% inode=98%): /var/tmp 2709 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2009&var-datasource=codfw+prometheus/ops [16:13:53] (03CR) 1020after4: [V: 03+2 C: 03+2] selenium: Update README.md file (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713217 (https://phabricator.wikimedia.org/T282237) (owner: 10Sahilgrewalhere) [16:14:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:23:31] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Ijon) 05Resolved→03Open (I'm assuming it's easier to re-open this for further related requests. If that's not the case, let me know!) Can you please add the following CNAMEs to the learn.wiki DN... [16:24:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:26:47] (03CR) 10Nikki Nikkhoui: Helmfile for image suggestion api (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [16:27:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:31:39] (03PS1) 10Jcrespo: admin: Update aprover of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 [16:32:05] (03PS2) 10Jcrespo: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 [16:34:39] RECOVERY - Disk space on maps2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2009&var-datasource=codfw+prometheus/ops [16:39:41] 10SRE, 10SRE-swift-storage, 10Thumbor: Thumbnails for PDF files on jv.wikisource.org show a HTTP 401 Unauthorized error - https://phabricator.wikimedia.org/T289860 (10AntiCompositeNumber) Headers: ` HTTP/2 401 Unauthorized date: Fri, 27 Aug 2021 15:09:48 GMT content-type: text/html; charset=UTF-8 content-len... [16:41:55] 10SRE, 10Infrastructure-Foundations, 10netops: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10cmooney) I had a look at this this morning (didn't catch the page when it fired and it cleared quickly as you say). Seems to be... [16:42:07] (03PS1) 10Ssingh: durum: update results page and remove redundant code [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) [16:42:15] 10SRE, 10Infrastructure-Foundations, 10netops: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10cmooney) p:05Triage→03Low [16:43:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:45:06] (03PS2) 10Ssingh: durum: update results page and remove redundant code [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) [16:45:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:46:29] (03PS1) 10Jbond: P:prometheus::ops: add cfssl_jobs to actually scrap metricts [puppet] - 10https://gerrit.wikimedia.org/r/715262 [16:46:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master [16:46:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps1005.eqiad.wmnet with reason: Resyncing from master [16:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:34] (03CR) 10Brennen Bearnes: [C: 03+1] aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [16:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:02] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/30897/durum1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:53:01] (03PS2) 10Jbond: P:prometheus::ops: add cfssl_jobs to actually scrap metrics [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) [16:56:06] (03PS1) 10Jgreen: Remove payments1008 from monitoring while we reconfigure it. [puppet] - 10https://gerrit.wikimedia.org/r/715263 [16:56:23] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:56:59] (03PS3) 10Jbond: P:prometheus::ops: add cfssl_jobs to actually scrap metrics [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) [16:57:28] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [16:58:17] (03CR) 10Jgreen: [C: 03+2] Remove payments1008 from monitoring while we reconfigure it. [puppet] - 10https://gerrit.wikimedia.org/r/715263 (owner: 10Jgreen) [16:58:35] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30899/console" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:04:04] (03PS4) 10Jbond: P:prometheus::ops: add cfssl_jobs to actually scrap metrics [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) [17:04:06] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:04:35] (03CR) 10jerkins-bot: [V: 04-1] P:prometheus::ops: add cfssl_jobs to actually scrap metrics [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:05:29] (03CR) 10Cwhite: P:prometheus::ops: add cfssl_jobs to actually scrap metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:05:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30901/console" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:07:06] (03CR) 10Jbond: [V: 03+1] "retest" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:07:11] (03CR) 10Jbond: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:08:15] (03PS5) 10Jbond: P:prometheus::ops: add cfssl_jobs to actually scrap metrics [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) [17:08:31] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:11:07] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:11:57] (03PS39) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [17:12:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:17:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:18:26] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Asaf) And also: stage.learn.wiki studio.stage.learn.wiki preview.stage.learn.wiki CNAME = http://wkm-stage-alb-1830818829.us-east-1.elb.amazonaws.com [17:21:20] RECOVERY - Long running screen/tmux on maps2007 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:21:35] (03PS1) 10Jelto: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) [17:23:47] (03CR) 10Herron: [C: 03+1] "nit: s/scrap/scrape" [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:26:26] (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: add cfssl_jobs to actually scrap metrics [puppet] - 10https://gerrit.wikimedia.org/r/715262 (https://phabricator.wikimedia.org/T286339) (owner: 10Jbond) [17:28:22] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:29:01] (03CR) 10Jelto: "@janis could you take a look? This should represent the renaming of view RBAC resources as discussed in https://phabricator.wikimedia.org/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [17:39:23] (03PS2) 10Jelto: helmfile.d admin rename view rbac resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715266 (https://phabricator.wikimedia.org/T251305) [17:39:31] (03PS9) 10Ryan Kemper: Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [17:40:45] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10AntiCompositeNumber) fwiw this issue was recently reported on https://commons.wikimedia.org/wiki/File:Anshu_Dik... [17:44:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:45:25] (03PS2) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) [17:49:36] (03CR) 10Volans: [C: 03+1] "LGTM syntactically :)" [puppet] - 10https://gerrit.wikimedia.org/r/715245 (https://phabricator.wikimedia.org/T289706) (owner: 10Andrew Bogott) [17:51:14] (03CR) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [17:53:44] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:42] (03PS2) 10Dduvall: aptrepo: Add gitlab-runner repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/715134 (https://phabricator.wikimedia.org/T287504) [17:54:44] (03PS12) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [17:59:12] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:06] (03PS1) 10Volans: admin: fix typo in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/715271 (https://phabricator.wikimedia.org/T287465) [18:02:37] (03CR) 10RLazarus: [C: 03+1] admin: fix typo in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/715271 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [18:03:42] (03CR) 10Volans: [C: 03+2] admin: fix typo in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/715271 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [18:04:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:05:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:59] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 (10Volans) @nskaggs sorry for the trouble, there was a typo in the puppet patch, it should be fixed now. [18:09:42] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:54] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:12:22] (03CR) 10Volans: [C: 03+2] "Merging to allow any follow up patch to not be blocked by CI. I'd be happy to address any additional comment that might arrive later on." [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [18:13:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:35] (03CR) 10Ryan Kemper: [C: 03+2] Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [18:16:25] (03CR) 10Volans: [C: 03+2] Fix newly reported pylint issues [software/cumin] - 10https://gerrit.wikimedia.org/r/715202 (owner: 10Volans) [18:17:10] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 101, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:17:24] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 136, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:18:29] (03Merged) 10jenkins-bot: Address newly reported pylint issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/715197 (owner: 10Volans) [18:20:29] (03CR) 10Andrew Bogott: [C: 03+2] cloud-cumin: exclude trove project from cumin runs [puppet] - 10https://gerrit.wikimedia.org/r/715245 (https://phabricator.wikimedia.org/T289706) (owner: 10Andrew Bogott) [18:22:32] (03Merged) 10jenkins-bot: Fix newly reported pylint issues [software/cumin] - 10https://gerrit.wikimedia.org/r/715202 (owner: 10Volans) [18:23:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:23:14] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:28:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:29:06] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:42:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:45:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:48] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:56:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:59:28] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:51] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) [19:00:46] (03CR) 10Jdlrobson: Correctly enable Vector language switcher treatment A/B test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700905 (https://phabricator.wikimedia.org/T269093) (owner: 10Phuedx) [19:02:05] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10RobH) [19:02:38] (03CR) 10Jdlrobson: Enable new Vector Languages-in-header feature & AB test for pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700705 (https://phabricator.wikimedia.org/T269093) (owner: 10Jdrewniak) [19:03:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10faidon) [19:04:12] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10faidon) 05Open→03Stalled There are some ongoing conversations with the WMCS team regarding the placement of their infrastructure in our network/infrastructure, and I think it would be good to... [19:05:33] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10wiki_willy) [19:07:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:18:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:24:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:31:51] (03CR) 10Legoktm: [C: 03+1] "LGTM, one unrelated suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [19:35:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:36:58] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:38:52] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:42:13] (03PS4) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [19:42:39] (03CR) 10Jdlrobson: "Will it be possible to deploy this next week Urbanecm? I'm to familiar with the branching process." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [19:52:24] (03CR) 10Urbanecm: [C: 04-1] Enable NearbyPages on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [19:53:54] (03CR) 10Cwhite: [C: 03+1] wmflib: add 'aliases' to Service [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [19:54:22] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:02] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:56:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:00] (03PS2) 10Kosta Harlan: bullseye-sssd: Add openssh-client [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/715215 (https://phabricator.wikimedia.org/T258841) [19:59:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:59:54] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:01:17] (03PS3) 10Ssingh: durum: update results page and remove redundant code [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) [20:01:41] (03CR) 10Ssingh: "Thanks, updated!" [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [20:05:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:06:00] (03CR) 10Ssingh: [C: 03+2] durum: update results page and remove redundant code [puppet] - 10https://gerrit.wikimedia.org/r/715260 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [20:10:12] (03CR) 10Jdlrobson: Enable NearbyPages on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:11:51] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Niharika) >>! In T288844#7296789, @Huji wrote: > My understanding is that the changes in the data are minimal from one version to the n... [20:13:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:16:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:18:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:25:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:31:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:32:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:43:25] (03PS1) 10Ssingh: durum: notify the uWSGI service for app file change [puppet] - 10https://gerrit.wikimedia.org/r/715278 [20:45:38] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/30903/durum1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/715278 (owner: 10Ssingh) [20:45:40] (03CR) 10Ssingh: [C: 03+2] durum: notify the uWSGI service for app file change [puppet] - 10https://gerrit.wikimedia.org/r/715278 (owner: 10Ssingh) [20:56:20] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:59:02] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:36] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:08:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:09:33] ^ still not sure what's causing httpbb to time out inconsistently in eqiad, I assume it's a cold cache but haven't nailed it down [21:09:48] going to roll it back and only keep the hourly test in codfw until I can get it sorted out [21:23:02] PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100% [21:31:42] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10RobH) [21:32:04] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10RobH) [21:32:18] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10RobH) [21:32:53] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n, 10Patch-For-Review: Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10TheDJ) @fgiunchedi you know if that patch makes sense ? [21:34:07] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10RobH) a:03Jclark-ctr [21:42:25] (03PS1) 10RLazarus: Revert "hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver." [puppet] - 10https://gerrit.wikimedia.org/r/715094 [21:42:59] (03PS2) 10RLazarus: Revert "hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver." [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) [21:43:05] (03CR) 10jerkins-bot: [V: 04-1] Revert "hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver." [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [21:43:32] (03CR) 10jerkins-bot: [V: 04-1] Revert "hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver." [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [21:44:29] (03PS3) 10RLazarus: Revert "hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver." [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) [21:44:32] :/ [21:44:47] rzl: what is the timeout set to? [21:44:54] legoktm: ten seconds [21:45:27] I've fired a bunch of traffic at it and I haven't been able to get it to take longer than 250 ms, which is part of why I suspect cold cache [21:45:42] yeah, that seems reasonable [21:47:21] (oops except that revert isn't right at all, I want to absent the systemd timer not remove it) [21:51:42] (03PS4) 10RLazarus: hieradata: Remove hourly httpbb run on cumin1001. [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) [21:52:33] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30904/console" [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [21:54:14] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:01:42] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:02:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:04:41] (03PS5) 10RLazarus: hieradata: Remove hourly httpbb run on cumin1001. [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) [22:05:58] (03CR) 10RLazarus: [C: 03+2] hieradata: Remove hourly httpbb run on cumin1001. [puppet] - 10https://gerrit.wikimedia.org/r/715094 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [22:08:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:10:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:11] (03PS1) 10Ssingh: durum: bikeshedding CSS fixes [puppet] - 10https://gerrit.wikimedia.org/r/715285 (https://phabricator.wikimedia.org/T289536) [22:13:11] legoktm: do you happen to know offhand how to cleanly remove a systemd timer? setting ensure => absent in systemd::timer::job just left it in place, in state `not-found failed failed` :/ [22:13:16] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:51] that's what I did in the past, try a manual `systemctl reset-failed` or `systemctl daemon-reload` ? [22:14:09] (03CR) 10BryanDavis: toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [22:14:13] ahh, reset-failed was the incantation, thanks [22:14:19] wonder if we can get Puppet to do that [22:15:12] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:16] (03PS4) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [22:19:18] (03PS1) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) [22:20:29] 10SRE, 10serviceops, 10Patch-For-Review: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10Legoktm) Just some unsorted thoughts: * Can we set the timeout to 120s (the MW request timeout) to see how long the request is actually taking, and whether cold caches is a reasonable thing to blam... [22:22:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:26:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:26:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:29:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:36] (03CR) 10Ssingh: [C: 03+2] durum: bikeshedding CSS fixes [puppet] - 10https://gerrit.wikimedia.org/r/715285 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [22:30:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:31:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:46:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:50:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:54:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:08:38] 10SRE, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Legoktm) 05Open→03Resolved [23:08:55] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) [23:09:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Clean up cron-specific elements of switchdc cookbooks - https://phabricator.wikimedia.org/T289078 (10Legoktm) 05Open→03Resolved I think this is all done now, woot! [23:09:40] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: Services without a service IP cannot automatically be switched by the switchdc cookbook - https://phabricator.wikimedia.org/T285707 (10Legoktm) [23:09:44] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) [23:11:21] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Legoktm) @fgiunchedi do you have any pointers on what switching to encrypted rsync entails? Is it just a puppet setting somewhere? [23:11:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:17:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:40:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:51:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down