[00:50:40] FIRING: VarnishHighThreadCount: Varnish's thread count on cp5023:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:05:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5023:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:25:40] RESOLVED: VarnishHighThreadCount: Varnish's thread count on cp5023:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:56:26] hello folks [07:56:55] as anticipated yesterday in #security I'd like to depool cp4037 and deploy the new linux kernel, then reboot and check [08:47:29] doing it now [08:47:47] ok [08:48:02] is there any reason why cp4037? [08:48:48] we use it for benthos testing but should be no issue, atm has the same configuration as all other hosts in ulsfo [08:49:46] fabfur: suk*he suggested that node in #security yesterday :) [08:50:02] if you prefer another one I'll keep it in mind for the next test [08:50:24] nope [08:50:32] ok for me [08:50:38] super [08:59:28] fabfur: cp4037 is up and looks good, do you mind to quickly check before I repool? [08:59:36] sure! [09:00:49] LGTM! [09:01:48] thankss [09:08:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9989347 (10ayounsi) We will need to migrate the whole range to a new prefix :( Running 2 ranges is going to be a pain long term, and would n... [09:29:15] 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9989379 (10ayounsi) @cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally... [09:45:36] 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9989412 (10hashar) [09:46:15] 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9989413 (10cmooney) >>! In T368513#9938867, @fgiunchedi wrote: > Those are SSH probes from local prometheus hosts indeed, in t... [11:08:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9989577 (10cmooney) 05Open→03Resolved [12:18:17] <_joe_> hi everyone, I plan to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054081/3 now unless you see a reason not to [12:18:41] <_joe_> this one should not cause any havoc, but I'll disable puppet everywhere but magru anyways [12:27:22] <_joe_> I'll take the silence to mean "we're all afk you can proceed" [12:27:34] lol [12:32:31] ok for me, how much time do you think puppet will be disabled on cp hosts? [12:34:53] _joe_: errr :) [12:35:23] <_joe_> fabfur: as soon as puppet runs on one host, which seems to mean "in a few" now [12:35:34] ack [12:36:54] <_joe_> fabfur: but I will have to merge the other patches, let's sync [12:39:53] <_joe_> fabfur: do you need to work with puppet or can I continue with the next patch? [12:40:12] pls go on, I have nothing in queue [12:48:26] <_joe_> merging the third patch, this one should add requestctl support for cache hits, let's see if it breaks anything :) [12:48:49] 🤞 [12:49:33] * volans redirects all pages to joe [12:49:46] volans: for now and forever? [12:49:53] _joe_: do we currently have any rule defined to impact cache hits as well? [12:50:07] <_joe_> vgutierrez: yes in cache-upload [12:50:15] <_joe_> you can see it on cp7009 [12:50:19] sukhe: https://bash.toolforge.org/quip/AU7VTzhg6snAnmqnK_pc [12:50:21] <_joe_> there's two actually [12:50:34] vgutierrez: we have three rules I think, one of which was disabled [12:50:39] volans: lol [12:51:26] ack [12:51:39] I'll keep an eye on hit-front TTFBs [12:51:40] <_joe_> vgutierrez: the file is /etc/varnish/requestctl-filters-hit.inc.vcl [12:52:10] <_joe_> one of the rules is a repetition of one we have in puppet heh [12:53:18] <_joe_> cp7009 seems happy, now testing a text node then I guess we can reenable puppet [12:54:45] <_joe_> I tend to forget how slow puppet runs are in cache pops :/ [12:54:54] really? [12:55:10] <_joe_> maybe puppet is just slow in general [12:55:13] I guess relatively maybe. but once you run puppet on alerting hosts, everything pales in comparison [12:55:19] <_joe_> but yes, latency to the master is terrible for puppet perf [12:56:54] <_joe_> is there a reasonable way to understand if varnish reloaded correctly? [12:57:09] <_joe_> or there's just looking at the quite cryptic logs of the service? [12:57:17] just checking the journal and we do have alerts if it does not [12:57:29] logs && alerts :) [12:57:57] <_joe_> ok [12:58:16] <_joe_> so i wasn't out of sync on it 😢 [12:58:23] _joe_: yeah, we got an alert for 7008 [12:58:25] 08:58:06 <+icinga-wm> PROBLEM - Confd vcl based reload on cp7008 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. [12:58:31] <_joe_> yes [12:58:35] <_joe_> not sure why [12:58:44] <_joe_> why it worked on upload and not there... [12:58:52] <_joe_> un maybe a race condition? [12:59:37] <_joe_> Jul 17 12:55:40 cp7008 varnishd[1170]: CLI telnet 127.0.0.1 51255 127.0.0.1 6082 Wr 106 No VCL named vcl-3971d473-1aba-4b4c-9bd8-5633bf553d27 known. [12:59:39] Jul 17 12:55:40 cp7008 varnishd[1170]: CLI telnet 127.0.0.1 51255 127.0.0.1 6082 Wr 106 No VCL named vcl-3971d473-1aba-4b4c-9bd8-5633bf553d27 known. [12:59:42] <_joe_> uhh not that clear :P [12:59:47] <_joe_> but I think I'm right [13:00:00] <_joe_> I should revert the last patch, make puppet run everywhere, then re-merge it [13:00:17] <_joe_> because it didn't fail on 7009 [13:01:06] <_joe_> the solution is to ask varnish to reload the config again I think, let me try [13:01:45] that should have been done by Puppet https://puppetboard.wikimedia.org/report/cp7008.magru.wmnet/5396bbb2a19d850eef122972950924bcf43b737d [13:02:47] <_joe_> sukhe: I think puppet does restart varnish before one of the files is available [13:02:56] <_joe_> we're waiting for confd to restart [13:03:21] <_joe_> sukhe: did the alert clear out now? [13:03:34] yep cleared [13:03:42] all good [13:03:44] <_joe_> I just reloaded varnish [13:03:52] <_joe_> so yeah if we want to avoid a flood of errors [13:03:56] <_joe_> we should do as I said [13:04:07] <_joe_> I'll revert, run puppet ebverywhere, re-merge [13:08:56] <_joe_> done, confirmed this way we won't have failures, proceeding [13:09:08] ok [13:10:49] <_joe_> I think it's an important milestone to be able to block scrapers/hotlinking on upload [13:12:15] yeah indeed [13:16:12] <_joe_> failed on cp3067.esams.wmnet [13:17:24] <_joe_> not really, whatever :P [13:17:33] didn't seem it failed [13:20:05] <_joe_> yep [13:20:20] <_joe_> sukhe: I'm using cumin, we know it's not a reliable piece of software [13:20:43] _joe_: going to sit this one out :) [13:21:12] lol [13:21:31] depends what are you doing [13:21:50] <_joe_> volans: just running puppet using run-puppet-agent [13:22:01] <_joe_> it completed successfully but cumin says otherwise [13:22:08] <_joe_> no output to prove it either [13:22:26] <_joe_> or maybe it's there, but when you have 112 hosts that's hard to find [13:22:33] https://www.puppet.com/docs/puppet/8/man/agent#usage-notes [13:22:42] sorry https://www.puppet.com/docs/puppet/7/man/agent#usage-notes [13:23:18] <_joe_> volans: run-puppet-agent, that you wrote, handles all that [13:23:55] <_joe_> fabfur: I'm done btw, thanks for bearing with me [13:23:57] $ git blame ./modules/profile/files/puppet/bin/run-puppet-agent | grep -c Giuseppe [13:24:00] 72 [13:24:03] $ git blame ./modules/profile/files/puppet/bin/run-puppet-agent | grep -c Riccardo [13:24:06] 16 [13:24:27] <_joe_> volans: we've run that script 1000s of times [13:24:28] and no it doesn't, it returns whatever puppet agent returns [13:24:36] so depends on the exit code of puppet [13:24:41] <_joe_> volans: so why on that host it failed? [13:24:49] <_joe_> while it completed successfully everywhere else? [13:24:56] <_joe_> and the output from puppet is the same? [13:25:08] volans: giving new meaning to git blame [13:25:42] <_joe_> (I would also ask why cumin doesn't highlight the output of failed runs more clearly, but I've asked that when it was created to no avail) [13:27:06] 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9990011 (10fgiunchedi) So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for... [13:46:55] jokes aside I don't see in the logs a clear error in this case, unfortunately for cumin we don't always log also debug like with cookbooks [13:58:36] 06Traffic, 10conftool, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9990341 (10Joe) 05Open→03Resolved [14:21:36] 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9990455 (10BCornwall) 05Stalled→03Resolved Thank you! We've achieved enough migration to consider this done, now! [14:39:54] _joe_: BTW, TTFB on hit-front doesn't seem to be impacted so far: https://grafana.wikimedia.org/goto/WCWpOWXSg?orgId=1 [16:17:30] bblack: (or anyone else who has thoughts) follow-up idea shunting active/active traffic to failoid: what if, in the discovery services section of the wmnet zone file, I simply switch the DYNA entry to geoip!disc-failoid? [16:17:31] (yes, I would need to add the latter to the mocks) [16:17:56] that kind of side-steps messing with changing my a/a service to a/p, and should give me the behavior I want [16:58:14] swfrench-wmf: not sure if that really works or not without looking at everything deeply again [16:58:20] and heading to another meeting right now [17:46:36] thanks, bblack: not urgent or anything :) from a quick re-read of the configs, I _think_ this would work. disc-failoid is an otherwise-normal geoip resource that we instantiate statically via puppet, so in theory I should be able to point, e.g., api-ro, at it instead of `geoip!disc-api-ro` [20:32:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9992603 (10cmooney) >>! In T370164#9989347, @ayounsi wrote: > We will need to migrate the whole range to a new prefix :( Running 2 ranges is... [21:07:26] 10netops, 06Infrastructure-Foundations, 06SRE: Issue with subscribing to GNMI telemetry on certain QFX5120 devices - https://phabricator.wikimedia.org/T370366 (10cmooney) 03NEW p:05Triage→03Low [21:08:01] 10netops, 06Infrastructure-Foundations, 06SRE: Issue with subscribing to GNMI telemetry on certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992754 (10cmooney) [21:08:03] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9992755 (10cmooney) [21:21:07] 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992812 (10cmooney) [22:18:59] 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992974 (10cmooney) [22:24:51] 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992984 (10cmooney)