[02:09:57] (PurgedHighEventLag) firing: (4) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:14:57] (PurgedHighEventLag) firing: (19) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:19:57] (PurgedHighEventLag) firing: (15) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:24:57] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:29:57] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:34:57] (PurgedHighEventLag) firing: (18) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:39:57] (PurgedHighEventLag) firing: (21) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:44:57] (PurgedHighEventLag) firing: (14) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:49:57] (PurgedHighEventLag) firing: (20) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:54:57] (PurgedHighEventLag) firing: (15) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [02:59:57] (PurgedHighEventLag) firing: (23) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [03:04:58] (PurgedHighEventLag) firing: (23) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [03:09:57] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [03:14:58] (PurgedHighEventLag) resolved: (32) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:29:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10ayounsi) I also like option 5 (hard-coding the conditional in Jinja to not configure RA if the device name starts... [07:41:57] (PurgedHighEventLag) firing: (7) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [07:46:57] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [07:51:57] (PurgedHighEventLag) firing: (18) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [07:56:57] (PurgedHighEventLag) firing: (19) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [07:59:33] !log restart purged on cp5017 as test to clear out consumer group timeouts and rejoin events [07:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:28] looks better now [08:01:57] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:02:01] the other purged that are alarming suffer from more-or-less the same, it seems all eqsin -> eqiad consumer group issues [08:04:55] so same issue re appearing on 5017 [08:05:03] Consumer group session timed out (in join-state steady) after 10072 ms without a successful response from the group coordinator (broker 1001, last error was Success): revoking assignment and rejoining group [08:05:26] there are also other alerts for eqsin [08:06:57] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:11:57] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:13:46] (discussing the issue on #sre, seems to be network related) [08:16:57] (PurgedHighEventLag) firing: (22) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:21:57] (PurgedHighEventLag) firing: (24) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:26:57] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [08:30:23] transport link drained, purged should recover soon-ish [08:31:58] (PurgedHighEventLag) resolved: (32) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:21:02] folks I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/919802 this afternoon if nobody opposes (a varnishkafka cleanup, in theory a noop) [10:27:18] elukey: looking good, thanks for the heads up [11:31:58] (PurgedHighEventLag) firing: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:36:58] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:41:58] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:46:58] (PurgedHighEventLag) firing: (22) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:51:58] (PurgedHighEventLag) firing: (26) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:56:58] (PurgedHighEventLag) firing: (10) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:01:58] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:06:58] (PurgedHighEventLag) firing: (18) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:11:58] (PurgedHighEventLag) firing: (24) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:16:58] (PurgedHighEventLag) firing: (19) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:21:58] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:26:58] (PurgedHighEventLag) firing: (20) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:31:58] (PurgedHighEventLag) firing: (18) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:36:58] (PurgedHighEventLag) firing: (17) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:41:58] (PurgedHighEventLag) firing: (27) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:46:58] (PurgedHighEventLag) resolved: (24) High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:29:14] mmmm again purged [13:41:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) >>! In T337057#8869828, @ayounsi wrote: > I also like option 5 (hard-coding the conditional in Jinja to n... [13:50:05] elukey: yeah... latency going super high [13:50:24] take https://grafana.wikimedia.org/goto/AL0XaMQ4k?orgId=1 as an example [13:51:57] vgutierrez: I think that top*ranks is going to depool the transport link again to mitigate the issue until we figure out what causes it [15:16:30] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) After talking with @Vgutierrez these are the steps the cookbook needs to implement: 1. Depool the host 2. Set profile::cache::varnish::frontend::enable_http_redirection: false (via netbox->hiera) 3. Set profile::cache:... [15:49:40] 10Traffic, 10Patch-For-Review: Write a cookbook to handle upgrades of ATS - https://phabricator.wikimedia.org/T335531 (10BCornwall) @Vgutierrez I'm not exactly sure what the scope of "upgrading ATS" is but since there's already a "HAProxy rolling upgrade" cookbook I basically just lifted that. [15:49:52] 10Traffic, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE: allow non-roots to pool/depool certain DNS Discovery services - https://phabricator.wikimedia.org/T250557 (10jbond) [16:34:38] 10Traffic, 10netops, 10Commons, 10Infrastructure-Foundations, 10WMF-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10jbond) [17:07:08] 10Traffic, 10Patch-For-Review: Write a cookbook to handle upgrades of ATS - https://phabricator.wikimedia.org/T335531 (10BCornwall) [18:52:34] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: use mgmt_junos for syslog and ntp - https://phabricator.wikimedia.org/T320244 (10ayounsi) 05Open→03Resolved a:03ayounsi All done where possible. [18:52:44] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) All done where possible. [18:58:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) the fasw and asw1-eqsin switches didn't create the `mgmt_junos` routing instance as they should have. https://gerrit.wikimedia.org/r/922161 works... [18:59:47] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) [20:24:46] 10netops, 10Infrastructure-Foundations, 10SRE: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Open→03Resolved a:03ayounsi Going to close this task as this is as far as we can go due to the fasw switches not being easily upgraded. [20:25:35] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) 05Open→03Resolved a:03ayounsi