[06:38:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10ayounsi) [07:18:07] 10Traffic, 10SRE, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10ema) Caches have now filled up. Response start looks good on cp3060 [[https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?viewPanel=5&orgId=1&var-host=cp3060... [09:01:42] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10cmooney) a:05cmooney→03None Sorry @Dzahn I should have updated it before now. Makes sense to re-assign to DC-Ops... [09:10:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10ayounsi) Example before/after for Telia in eqiad: `lines=20 ayounsi@re0.cr2-eqiad> show route advertising-protocol bgp 80.239.132.225 inet.0: 852341 des... [09:15:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10ayounsi) Confirmed with Telia's looking glass: https://lg.twelve99.net/?type=bgp&router=prs-b6&address=185.71.138.0/24 [12:30:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10ayounsi) 05Open→03Resolved a:03ayounsi A good baseline has now been applied across most of our transits. Further tuning will happen when sub-optimal... [14:27:04] hi all i had a meeting which over ran and then had to eat some food, as such bit late starting the change i proposed yesterday but will be starting now (https://gerrit.wikimedia.org/r/c/operations/puppet/+/662688) [14:27:22] jbond: ack [14:32:37] jbond: rolling varnish upgrades are in progress so if you see very little traffic on some nodes for a few minutes do not necessarily immediately worry [14:33:09] ema: ack thanks [14:33:18] the upgrade procedure has nothing to do with puppet so there should be no issue with your work I think [14:33:25] do you want me to ait i can postpone untill tomorrow [14:33:30] ack [14:34:31] so yeah, no need to postpone [14:59:41] bblack, ema: was there any update on the LVS options for the eqiad expansion following your team meeting yesterday? [15:17:29] bblack: i have tested on cp1075, lvs4007, authdns2001 (as well as a ms-be,mc and wcqs host). i didn;t see any drop in tcpdump or anything in syslog or dmesg when the change was applied. is there anything you want to test before i re-enable puppet? [15:18:09] topranks: question_mark is going to talk with faidon/willy and check if 2B is feasible or not and anything else (basically, if that option's a go or no-go) [15:20:31] jbond: seems reasonable to just let them go, if it never executed the setter variant of the ethtool and blipped traffic so far. [15:21:02] (but maybe don't force them all at once, so if there are a few edge cases, at least they naturally splay!) [15:22:08] bblack: ack ill re-enable site by site over the nextfew hours [15:22:21] of course there will also be the 30 minute puppet splay [15:28:17] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) {F34697825} [15:33:28] bblack: thanks for the update, let's see what comes back from that. [16:29:17] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [16:29:46] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [16:30:20] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [17:19:56] (VarnishTrafficDrop) firing: 65% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [17:24:56] (VarnishTrafficDrop) resolved: 67% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [18:05:46] fyi the interface-rpy change should be completly rolled out and all hosts should have uppet enabled again [18:07:45] *interface-rps [20:34:35] 10Traffic, 10SRE, 10observability, 10cloud-services-team (Kanban): cloudelastic icinga TLS cert alerts - https://phabricator.wikimedia.org/T293826 (10Dzahn) [20:36:42] 10Traffic, 10Discovery-Search, 10SRE, 10observability: cloudelastic icinga TLS cert alerts - https://phabricator.wikimedia.org/T293826 (10Dzahn) [20:38:10] 10Traffic, 10Discovery-Search, 10SRE, 10observability: cloudelastic icinga TLS cert alerts - https://phabricator.wikimedia.org/T293826 (10Dzahn) p:05Triage→03Low ` 20:36 <+icinga-wm> ACKNOWLEDGEMENT - WMF Cloud -Psi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRIT... [20:51:08] 10Traffic, 10Discovery-Search, 10SRE, 10observability: cloudelastic icinga TLS cert alerts - https://phabricator.wikimedia.org/T293826 (10Dzahn) additional issue is these alerts are flapping and keep coming back so a simple ACK wasn't enough. downtiming for a day for now [20:54:56] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) affected hosts I am ACKing right now in Icinga: contint2001.mgmt ms-fe200... [20:56:52] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [21:36:31] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) |ores2005.mgmt|PER430| |gerrit2001.mgmt|PER430| |ms-fe2006.mgmt|PER430| |... [21:39:35] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [23:28:07] 10Acme-chief, 10SRE, 10Patch-For-Review: acme-chief is down: ValueError: OCSP response status is not successful so the property has no value - https://phabricator.wikimedia.org/T282490 (10Dzahn) I was wondering the same but assume this should stay open until acme_chief 3.0 has been deployed?