[02:57:24] 10Traffic, 10Analytics-Radar, 10SRE, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10AntiCompositeNumber) It's not just `/static`, JavaScript and CSS... [06:58:54] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: Beta cluster certificates have expired (September 2020) - https://phabricator.wikimedia.org/T262806 (10Aklapper) [07:33:39] 10Traffic, 10SRE, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10ema) >>! In T292290#7418496, @Krinkle wrote: > I've made some improvements to the by-host dash that may be of use: > 10Traffic, 10SRE, 10User-ema: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 - https://phabricator.wikimedia.org/T293605 (10ema) [09:28:40] 10Traffic, 10SRE, 10User-ema: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 - https://phabricator.wikimedia.org/T293605 (10ema) p:05Triage→03Low Setting priority to low for now as these seem isolated, sporadic crashes and systemd took care of the restarts as expected so there was no production impact. [09:29:37] _joe_: I took the liberty of adding you as a subscriber as FYI, feel free to unsubscribe ofc :) ^ [09:30:10] <_joe_> ema: probably linked to the rebalances elukey has done? [09:30:20] <_joe_> we can safely blame him anyways I'd say [09:33:31] so I did the main-codfw rebalances in two days, Oct 11 and 12 - https://phabricator.wikimedia.org/T288825#7416285 [09:33:45] <_joe_> elukey: and eqiad? [09:34:36] Friday 15th and today [09:46:33] meeting, bbiab [10:24:56] (VarnishTrafficDrop) firing: 67% GET drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=ulsfo - https://alerts.wikimedia.org [10:31:46] little req fluctuation in ulsfo, nothing worrisome ^ [10:49:56] (VarnishTrafficDrop) resolved: 69% GET drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=ulsfo - https://alerts.wikimedia.org [12:46:18] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Epic: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [13:59:08] topranks: XioNoX: nice work on the lvs options doc, creative solutions in there [13:59:42] i don't know if you had a chance to discuss them after I left, but the last two options look fairly good to me? [14:33:31] hi all i plan to roll out a change to interface::rps tomorrow. for this i will disable puppet on all lvs, dns and cp servers (as well as some others), merge the change and test it on a canary for each role to ensure there are no interface blips. assummin all goes well i will then re-enable puppet and let the change deploy. reverting if there is a blip. I expect the entire pice of work to take [14:33:37] abot an hour to complete (should be quicker) and ... [14:33:39] ... planned to start at ~13:00 UTC. let me know if there are any concerns etc (cc ema vgutierrez ) [14:33:42] the change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/730210 [14:34:41] that overlaps with the Traffic team weekly meeting [14:37:48] vgutierrez: if i bump it to 14:00 would that work? [14:38:00] jbond: that would be awesome [14:38:12] no problem thanks 14:00 it is :) [14:38:17] thanks [14:56:14] traffic folks: reminder to update the SRE meeting doc, don't hate me, I am a bot! [15:07:26] question_mark: Thanks for the feedback. [15:07:41] We discussed the other options after you left but no particular preference was expressed. [15:08:19] I think myself and Arzhel probably prefer the "direction connection" (2A/B) options over bridging the Vlans from the new racks into the existing rows. [15:09:48] But those options would work well and shouldn't cause any scaling issues (mac table on existing switches etc.). So they are definitely options. [15:39:42] 10Acme-chief, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10Vgutierrez) [15:40:16] 10Acme-chief, 10Traffic, 10SRE, 10Patch-For-Review: Implement a watchdog mechanism on acme-chief - https://phabricator.wikimedia.org/T292619 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:43:54] 10Acme-chief, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10Vgutierrez) @dcaro I've implemented systemd's watchdog support on acme-chief. This is already running on the production instances and it shoul... [16:20:54] 10Acme-chief, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10dcaro) \o/ thanks a lot @Vgutierrez, will try it soon(ish) [16:21:34] 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10dcaro) [17:12:57] (VarnishTrafficDrop) firing: 68% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [17:17:57] (VarnishTrafficDrop) resolved: 64% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [18:26:41] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Epic: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Dzahn) p:05Triage→03High [18:38:52] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Epic: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Dzahn) p:05High→03Medium [18:43:47] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Dzahn) [20:21:40] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul Could we schedule a firmware upgrade for gerrit2001 due to this issue? (not high prio) [20:23:43] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn sure we can. [20:30:18] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @cmooney Thank you very much for all the debugging effort you put into this and thanks @Papaul for confirming... [20:32:57] Hey traffic, for some reason "wmf_auto_restart_systemd-timesyncd.service" failed on dns1001, dns4001 and dns5002 all a couple hours ago. [20:34:36] "not present or not running". maybe you dont want it but a unit is not fully removed [20:48:33] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn I will go for turning this into a tracking ticket for firmware upgrades with check boxes of affected... [21:43:46] mutante: thanks! [21:43:59] yep!:) [21:44:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/730852 seems to be the trigger, although I'm not sure yet if this manager to override the timesyncd ensure param the wrong way, or it's just the lack of a $ensure on the auto_restart bit [21:45:01] I think the latter [21:50:12] bblack: so it's NOT supposed to be on dns servers, right? yea, no $ensure passed to auto_restart part but default is present [21:50:27] unit does not exist but it tries to restart nonexisting unit [21:51:32] yeah something like that. I put up a hypothetical fix, but let's let jbond and others have a look at it first [21:52:24] (the "dnsbox" servers all use legacy ntpd because they're the internal servers to all the other machiens that are timesyncd clients) [21:53:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/731844 [21:53:22] compares now to your fix, same thing I guess [21:54:18] indeed :). yea, and I left a comment on John's change, I'm sure he will fix it asap [21:54:33] let me clean up Icinga until then [21:58:57] mutante: sorry, didn't realize you had a patch going too! [22:00:12] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) p:05Triage→03Medium [22:00:20] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [22:00:29] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [22:06:04] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: Patch Telxius transport cross-connect to cr1-eqiad - https://phabricator.wikimedia.org/T293709 (10RobH) [22:13:14] bblack: no worries, just a minute difference or so