[00:50:40] <jinxer-wm>	 FIRING: VarnishHighThreadCount: Varnish's thread count on cp5023:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[01:05:40] <jinxer-wm>	 FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5023:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[01:25:40] <jinxer-wm>	 RESOLVED: VarnishHighThreadCount: Varnish's thread count on cp5023:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[07:56:26] <elukey>	 hello folks
[07:56:55] <elukey>	 as anticipated yesterday in #security I'd like to depool cp4037 and deploy the new linux kernel, then reboot and check
[08:47:29] <elukey>	 doing it now
[08:47:47] <fabfur>	 ok
[08:48:02] <fabfur>	 is there any reason why cp4037? 
[08:48:48] <fabfur>	 we use it for benthos testing but should be no issue, atm has the same configuration as all other hosts in ulsfo
[08:49:46] <elukey>	 fabfur: suk*he suggested that node in #security yesterday :)
[08:50:02] <elukey>	 if you prefer another one I'll keep it in mind for the next test
[08:50:24] <fabfur>	 nope
[08:50:32] <fabfur>	 ok for me
[08:50:38] <elukey>	 super
[08:59:28] <elukey>	 fabfur: cp4037 is up and looks good, do you mind to quickly check before I repool?
[08:59:36] <fabfur>	 sure!
[09:00:49] <fabfur>	 LGTM!
[09:01:48] <elukey>	 thankss
[09:08:46] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9989347 (10ayounsi) We will need to migrate the whole range to a new prefix :( Running 2 ranges is going to be a pain long term, and would n...
[09:29:15] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9989379 (10ayounsi) @cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally...
[09:45:36] <wikibugs>	 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9989412 (10hashar)
[09:46:15] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9989413 (10cmooney) >>! In T368513#9938867, @fgiunchedi wrote: > Those are SSH probes from local prometheus hosts indeed, in t...
[11:08:15] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9989577 (10cmooney) 05Open→03Resolved
[12:18:17] <_joe_>	 hi everyone, I plan to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054081/3 now unless you see a reason not to
[12:18:41] <_joe_>	 this one should not cause any havoc, but I'll disable puppet everywhere but magru anyways
[12:27:22] <_joe_>	 I'll take the silence to mean "we're all afk you can proceed"
[12:27:34] <volans>	 lol
[12:32:31] <fabfur>	 ok for me, how much time do you think puppet will be disabled on cp hosts? 
[12:34:53] <vgutierrez>	 _joe_: errr :)
[12:35:23] <_joe_>	 fabfur: as soon as puppet runs on one host, which seems to mean "in a few" now
[12:35:34] <fabfur>	 ack
[12:36:54] <_joe_>	 fabfur: but I will have to merge the other patches, let's sync
[12:39:53] <_joe_>	 fabfur: do you need to work with puppet or can I continue with the next patch?
[12:40:12] <fabfur>	 pls go on, I have nothing in queue
[12:48:26] <_joe_>	 merging the third patch, this one should add requestctl support for cache hits, let's see if it breaks anything :)
[12:48:49] <fabfur>	 🤞
[12:49:33] * volans redirects all pages to joe
[12:49:46] <sukhe>	 volans: for now and forever? 
[12:49:53] <vgutierrez>	 _joe_: do we currently have any rule defined to impact cache hits as well?
[12:50:07] <_joe_>	 vgutierrez: yes in cache-upload
[12:50:15] <_joe_>	 you can see it on cp7009
[12:50:19] <volans>	 sukhe: https://bash.toolforge.org/quip/AU7VTzhg6snAnmqnK_pc
[12:50:21] <_joe_>	 there's two actually
[12:50:34] <sukhe>	 vgutierrez: we have three rules I think, one of which was disabled
[12:50:39] <sukhe>	 volans: lol
[12:51:26] <vgutierrez>	 ack
[12:51:39] <vgutierrez>	 I'll keep an eye on hit-front TTFBs
[12:51:40] <_joe_>	 vgutierrez: the file is /etc/varnish/requestctl-filters-hit.inc.vcl
[12:52:10] <_joe_>	 one of the rules is a repetition of one we have in puppet heh
[12:53:18] <_joe_>	 cp7009 seems happy, now testing a text node then I guess we can reenable puppet
[12:54:45] <_joe_>	 I tend to forget how slow puppet runs are in cache pops :/
[12:54:54] <sukhe>	 really? 
[12:55:10] <_joe_>	 maybe puppet is just slow in general
[12:55:13] <sukhe>	 I guess relatively maybe. but once you run puppet on alerting hosts, everything pales in comparison
[12:55:19] <_joe_>	 but yes, latency to the master is terrible for puppet perf
[12:56:54] <_joe_>	 is there a reasonable way to understand if varnish reloaded correctly?
[12:57:09] <_joe_>	 or there's just looking at the quite cryptic logs of the service?
[12:57:17] <sukhe>	 just checking the journal and we do have alerts if it does not
[12:57:29] <vgutierrez>	 logs && alerts :)
[12:57:57] <_joe_>	 ok
[12:58:16] <_joe_>	 so i wasn't out of sync on it 😢
[12:58:23] <sukhe>	 _joe_: yeah, we got an alert for 7008
[12:58:25] <sukhe>	 08:58:06 <+icinga-wm> PROBLEM - Confd vcl based reload on cp7008 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. 
[12:58:31] <_joe_>	 yes
[12:58:35] <_joe_>	 not sure why
[12:58:44] <_joe_>	 why it worked on upload and not there...
[12:58:52] <_joe_>	 un maybe a race condition?
[12:59:37] <_joe_>	 Jul 17 12:55:40 cp7008 varnishd[1170]: CLI telnet 127.0.0.1 51255 127.0.0.1 6082 Wr 106 No VCL named vcl-3971d473-1aba-4b4c-9bd8-5633bf553d27 known.
[12:59:39] <sukhe>	 Jul 17 12:55:40 cp7008 varnishd[1170]: CLI telnet 127.0.0.1 51255 127.0.0.1 6082 Wr 106 No VCL named vcl-3971d473-1aba-4b4c-9bd8-5633bf553d27 known.
[12:59:42] <_joe_>	 uhh not that clear :P
[12:59:47] <_joe_>	 but I think I'm right
[13:00:00] <_joe_>	 I should revert the last patch, make puppet run everywhere, then re-merge it
[13:00:17] <_joe_>	 because it didn't fail on 7009
[13:01:06] <_joe_>	 the solution is to ask varnish to reload the config again I think, let me try
[13:01:45] <sukhe>	 that should have been done by Puppet https://puppetboard.wikimedia.org/report/cp7008.magru.wmnet/5396bbb2a19d850eef122972950924bcf43b737d
[13:02:47] <_joe_>	 sukhe: I think puppet does restart varnish before one of the files is available
[13:02:56] <_joe_>	 we're waiting for confd to restart
[13:03:21] <_joe_>	 sukhe: did the alert clear out now?
[13:03:34] <sukhe>	 yep cleared
[13:03:42] <sukhe>	 all good
[13:03:44] <_joe_>	 I just reloaded varnish
[13:03:52] <_joe_>	 so yeah if we want to avoid a flood of errors
[13:03:56] <_joe_>	 we should do as I said
[13:04:07] <_joe_>	 I'll revert, run puppet ebverywhere, re-merge
[13:08:56] <_joe_>	 done, confirmed this way we won't have failures, proceeding
[13:09:08] <sukhe>	 ok
[13:10:49] <_joe_>	 I think it's an important milestone to be able to block scrapers/hotlinking on upload
[13:12:15] <sukhe>	 yeah indeed
[13:16:12] <_joe_>	 failed on cp3067.esams.wmnet
[13:17:24] <_joe_>	 not really, whatever :P
[13:17:33] <sukhe>	 didn't seem it failed
[13:20:05] <_joe_>	 yep
[13:20:20] <_joe_>	 sukhe: I'm using cumin, we know it's not a reliable piece of software
[13:20:43] <sukhe>	 _joe_: going to sit this one out :)
[13:21:12] <volans>	 lol
[13:21:31] <volans>	 depends what are you doing
[13:21:50] <_joe_>	 volans: just running puppet using run-puppet-agent
[13:22:01] <_joe_>	 it completed successfully but cumin says otherwise
[13:22:08] <_joe_>	 no output to prove it either
[13:22:26] <_joe_>	 or maybe it's there, but when you have 112 hosts that's hard to find
[13:22:33] <volans>	 https://www.puppet.com/docs/puppet/8/man/agent#usage-notes
[13:22:42] <volans>	 sorry https://www.puppet.com/docs/puppet/7/man/agent#usage-notes
[13:23:18] <_joe_>	 volans: run-puppet-agent, that you wrote, handles all that
[13:23:55] <_joe_>	 fabfur: I'm done btw, thanks for bearing with me
[13:23:57] <volans>	 $ git blame ./modules/profile/files/puppet/bin/run-puppet-agent | grep -c Giuseppe
[13:24:00] <volans>	 72
[13:24:03] <volans>	 $ git blame ./modules/profile/files/puppet/bin/run-puppet-agent | grep -c Riccardo
[13:24:06] <volans>	 16
[13:24:27] <_joe_>	 volans: we've run that script 1000s of times
[13:24:28] <volans>	 and no it doesn't, it returns whatever puppet agent returns
[13:24:36] <volans>	 so depends on the exit code of puppet
[13:24:41] <_joe_>	 volans: so why on that host it failed?
[13:24:49] <_joe_>	 while it completed successfully everywhere else?
[13:24:56] <_joe_>	 and the output from puppet is the same?
[13:25:08] <sukhe>	 volans: giving new meaning to git blame
[13:25:42] <_joe_>	 (I would also ask why cumin doesn't highlight the output of failed runs more clearly, but I've asked that when it was created to no avail)
[13:27:06] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9990011 (10fgiunchedi) So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for...
[13:46:55] <volans>	 jokes aside I don't see in the logs a clear error in this case, unfortunately for cumin we don't always log also debug like with cookbooks
[13:58:36] <wikibugs>	 06Traffic, 10conftool, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9990341 (10Joe) 05Open→03Resolved
[14:21:36] <wikibugs>	 06Traffic, 10GitLab (Project Migration), 13Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623#9990455 (10BCornwall) 05Stalled→03Resolved Thank you!  We've achieved enough migration to consider this done, now!
[14:39:54] <vgutierrez>	 _joe_: BTW, TTFB on hit-front doesn't seem to be impacted so far: https://grafana.wikimedia.org/goto/WCWpOWXSg?orgId=1
[16:17:30] <swfrench-wmf>	 bblack: (or anyone else who has thoughts) follow-up idea shunting active/active traffic to failoid: what if, in the discovery services section of the wmnet zone file, I simply switch the DYNA entry to geoip!disc-failoid?
[16:17:31] <swfrench-wmf>	 (yes, I would need to add the latter to the mocks)
[16:17:56] <swfrench-wmf>	 that kind of side-steps messing with changing my a/a service to a/p, and should give me the behavior I want
[16:58:14] <bblack>	 swfrench-wmf: not sure if that really works or not without looking at everything deeply again
[16:58:20] <bblack>	 and heading to another meeting right now
[17:46:36] <swfrench-wmf>	 thanks, bblack: not urgent or anything :) from a quick re-read of the configs, I _think_ this would work. disc-failoid is an otherwise-normal geoip resource that we instantiate statically via puppet, so in theory I should be able to point, e.g., api-ro, at it instead of `geoip!disc-api-ro`
[20:32:08] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9992603 (10cmooney) >>! In T370164#9989347, @ayounsi wrote: > We will need to migrate the whole range to a new prefix :( Running 2 ranges is...
[21:07:26] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Issue with subscribing to GNMI telemetry on certain QFX5120 devices - https://phabricator.wikimedia.org/T370366 (10cmooney) 03NEW p:05Triage→03Low
[21:08:01] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Issue with subscribing to GNMI telemetry on certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992754 (10cmooney)
[21:08:03] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9992755 (10cmooney)
[21:21:07] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992812 (10cmooney)
[22:18:59] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992974 (10cmooney)
[22:24:51] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9992984 (10cmooney)