[09:19:34] <akosiaris>	 FYI, doing some pod rolling restarts in eqiad trying to reproduce https://phabricator.wikimedia.org/T366094
[09:19:56] <kamila_>	 cc marostegui ^
[09:20:38] <marostegui>	 ok
[09:20:40] <marostegui>	 thanks for the heads up
[10:37:20] <akosiaris>	 vgutierrez: I restarted pybal on lvs1019 for ^
[10:37:40] <akosiaris>	 I also set parse1002 to inactive. It was spamming pybal logs enough
[10:42:59] <fabfur>	 not vg but ok :) 
[10:56:52] <marostegui>	 we just got a p4ge....https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&refresh=1m
[10:57:03] <marostegui>	 can this be related to any of the issues from k8s akosiaris effie ?
[10:57:08] <akosiaris>	 almost certainly
[10:57:24] <akosiaris>	 ah, that alert? 
[10:57:44] <akosiaris>	 yeah we need to remove that alert I think eventually. We are alerting on the wrong thing. It was a nice signal in the past
[10:58:10] <marostegui>	 It looks like it is recovering though
[10:58:51] <RhinosF1>	 I did notice slowness fwiw, may have been my mobile data though cause it's a bit rubbish
[11:04:36] <akosiaris>	 RhinosF1: do you notice slowness right now?
[11:05:46] <akosiaris>	 we did have increased latencies for a bit while debugging something 
[11:08:04] <RhinosF1>	 akosiaris: nope, not anymore
[11:08:33] <akosiaris>	 ok, you might notice it once more soon. It's the last test in my battery of tests
[11:10:28] <akosiaris>	 running my test now, it has the highest chance of all my tests of causing some issues
[11:10:33] <akosiaris>	 they will be transient however
[11:15:05] <marostegui>	 akosiaris: it did again yep
[11:19:32] <akosiaris>	 mw-api-ext is the only thing still running btw
[11:20:49] <akosiaris>	 can someone take a look at gerrit please? 
[11:21:04] <akosiaris>	 I am cleaning up my tests, I wanna leave things ok for mw deployers
[11:21:18] <marostegui>	 akosiaris: what is up with gerrit? it seems to be working for me
[11:21:46] <akosiaris>	 jelto is on it. It loaded for me fine too just now, but pretty slowly
[11:21:53] <marostegui>	 ok
[11:23:04] <akosiaris>	 !log T366094 re-undeploy otel-collector, it being around increased traffic to the API >50%
[11:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:09] <stashbot>	 T366094: k8s master capacity issues - https://phabricator.wikimedia.org/T366094
[11:23:47] <jelto>	 the gerrit machine is seeing quite some load (taffic wise and cpu)
[11:27:12] <akosiaris>	 I am re-enabling puppet and running it on lvs1019
[12:14:33] <klausman>	 Say... how bad is it when an install/reimage ends the installer phase with a kernel panic? Asking for uh, a friend.
[12:14:42] <cdanis>	 akosiaris: I’ll be getting online shortly, can you tell me more about your tests?
[12:14:54] <marostegui>	 klausman: right before the reboot? that's normal
[12:15:01] <klausman>	 Oh, phew.
[12:34:57] <elukey>	 hello on-callers! I am going to complete the rollout of PKI TLS certs for thanos-swift
[12:35:06] <marostegui>	 thanks elukey
[12:35:21] <elukey>	 in the past it affected Tegola (maps) but we already have two nodes with the new certs and nothing horrible was registered
[12:59:38] <cdanis>	 thanks elukey 
[13:06:07] <akosiaris>	 cdanis: I am gonna wrap them up and document them in the task
[13:06:21] <cdanis>	 ok
[13:06:29] <cdanis>	 otelcol was part of the trigger, but only part of it
[13:06:32] <cdanis>	 to be clear
[13:12:30] <effie>	 akosiaris: is our tl;dr that, we will just have a lot of traffic/load during deploys, since we have more hosts more services etc  
[13:12:54] <effie>	 so we scale horizontally if needed? 
[13:19:54] <akosiaris>	 cdanis: it was part of the issue, but it was a very large part of the issue. 
[13:20:01] <cdanis>	 yes
[13:20:19] <cdanis>	 I think it was responsible for something like 50%-66% of the network traffic from the original masters
[13:20:38] <akosiaris>	 effie: no recommendations yet, aside from figuring out a bit what exactly the otel collector is doing and definetely finishing up the wikikube-ctrl work
[13:20:45] <akosiaris>	 but don't kill the old masters just yet
[13:21:11] <effie>	 I don't think we are in a rush to kiil the old masters anyway
[13:23:52] <cdanis>	 akosiaris: I have a decent handle on what it is doing, fwiw, and I'm also looking at the source rn https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/k8sattributesprocessor/internal/kube/client.go
[13:33:40] <kamila_>	 cdanis: am I correct to think that running it as a daemonset means we get n times the load for the same functionality as a single replica? That's how I understood the docs, but I only looked briefly 
[13:34:00] <cdanis>	 kamila_: no -- when running as a daemonset, each instance only queries the API for pods on its local node
[13:34:49] <cdanis>	 that's basically how it's designed to be deployed on k8s
[13:34:57] <kamila_>	 Ah, OK, so it does shard, thank you 
[13:35:02] <cdanis>	 I think we've just never had to think about horizontally scalability of the apiserver before
[13:35:20] <kamila_>	 Yeah, makes sense 
[13:38:05] <cdanis>	 we could potentially disable some of the attributes it adds, which could reduce the number of API calls it makes
[13:38:21] <cdanis>	 but otoh it's running in exactly the same configuration on codfw just fine
[13:43:07] <cdanis>	 example: 2024-05-29T11:07:15.126Z	info	kube/client.go:113	k8s filtering	{"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "labelSelector": "", "fieldSelector": "spec.nodeName=kubernetes1057.eqiad.wmnet"}
[13:43:38] <cdanis>	 and that gets applied to all the API calls (LIST / WATCH) that each collector makes
[13:46:52] <cdanis>	 hm
[13:47:03] <cdanis>	 I possibly have spotted a way to reduce the load further
[13:47:31] <cdanis>	 see `"pipeline": "metrics"` up above?  we don't need that, nor one for logs
[13:47:38] <cdanis>	 (at least not at present time 😇)
[14:06:09] <cdanis>	 akosiaris: thanks for the writeup.  I have a maybe-counterintuitive hunch that, by removing CPU from the cluster, you're making the network saturation situation better
[14:09:00] <Emperor>	 Sorry, stupid questions again. I'm trying to set up a new discovery record (for apus, a new Ceph-backed S3 service), so I've got hostnames &c to use in actually setting it up. I started at https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production and step one is essentially "fill out an entire Wmflib::service entry in hieradata/common/service.yaml", which needs LVS IPs. How do I go about getting those allocated?
[14:09:23] <Emperor>	 The example in the docs ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/545225 ) predates much of this, so isn't very helpful
[14:09:23] <akosiaris>	 cdanis: I have the EXACT same thinking
[14:09:47] <akosiaris>	 I am artificially limiting the TX from kubemaster1002 because it's CPU is pegged at 100%
[14:09:54] <cdanis>	 yep!
[14:10:00] <akosiaris>	 and this is supported by that 272MB/s we saw yesterday
[14:10:15] <cdanis>	 and it needs that CPU to handle the mutating operations that will cause a bunch of network TX to all the low-cpu-but-high-bw WATCH clients
[14:10:25] <cdanis>	 and so you add more CPU, you increase the rate of mutating operations, which, etc
[14:11:05] <cdanis>	 and re: balancing them, what you can do is, you kill them and make the client reconnect after a while ;)
[14:11:14] <cdanis>	 didn't we do the same thing for ATS connections to its backends? albeit on its side
[14:11:30] <akosiaris>	 yet, which isn't very easy to do on k8s side without a apiserver restart
[14:11:34] <akosiaris>	 yes*
[14:11:54] <cdanis>	 yeah that's why I was suggesting raising it with upstream
[14:12:54] <Emperor>	 Ah, I think I need https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service (though there is a bit of chicken-and-egg here)
[14:14:04] <akosiaris>	 Emperor: you don't need LVS IPs to add a new service in discovery
[14:14:10] <akosiaris>	 look at helm-charts for an example
[14:14:22] <akosiaris>	 it's an LVSless discovery enabled service
[14:14:25] <akosiaris>	 there are a few more
[14:14:32] <sukhe>	 Emperor: you just need the service IPs, not the LVS ones; you need the class for that
[14:15:58] <sukhe>	 Emperor: happy to help in -traffic. feel free to ping us there if required
[14:19:53] <cdanis>	 akosiaris: btw -- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1037083/1
[14:20:22] <cdanis>	 you are right, I hadn't been factoring in that all five control plane nodes are still active in codfw
[14:20:34] <akosiaris>	 if what you say on the subject is true, hurray!
[14:20:45] <akosiaris>	 +1ed
[14:20:46] <cdanis>	 maybe.  it is a hunch based on a cursory reading of the logs + code
[14:20:56] <cdanis>	 if you don't mind I'll try it on codfw
[14:21:00] <akosiaris>	 go ahead
[14:21:03] <Emperor>	 oh, yes, the documentation and type declaration in modules/wmflib/types/service.pp don't agree as to whether the "lvs" parameter is optional or not (and confused me that the 'ip' parameter had to contain LVS IPs
[14:21:27] <cdanis>	 Emperor: when in doubt, trust existing instantiations over the comments ;)
[14:21:51] <Emperor>	 I hear you, but I'm a bit out of my depth here and trying to RTFM rather than cloning and hacking
[14:22:25] <Emperor>	 I will take sukhe up on their offer of assistance in -traffic, though.
[14:45:58] <cdanis>	 akosiaris: before and after: https://grafana.wikimedia.org/goto/BjzDn7yIg?orgId=1
[14:47:51] <cdanis>	 akosiaris: if it sounds good to you, I'll sum up kube-apiserver network load seen in codfw with this configuration during the MW train deploy, and if it's under like idk 1.5Gbit/s? then we can re-deploy otelcol in eqiad as well
[14:48:23] <cdanis>	 btw, something *not* mentioned on the capacity crunch bug so far: apparently we have made scap deploys significantly (like 50%!) faster
[15:17:57] <akosiaris>	 cdanis: SGTM
[16:20:41] <jynus>	 jhathaway: is puppet merge taking long?
[16:21:28] <jhathaway>	 merged
[16:21:34] <jynus>	 thanks
[17:43:21] <bblack>	 the lvs1019 thing, bunch of mwNNNN failing health probes from lvs?
[17:43:31] <cdanis>	 sorry, let's talk here
[17:43:35] <cdanis>	 not in #-operations
[17:43:43] <bblack>	 for mwdebug
[17:44:10] <cdanis>	 yeah, the lvs hosts can't connect to port 4444 on I think probably most/all of the kube machines
[17:44:22] <cdanis>	 mediawiki-pinkunicorn-tls-service   NodePort   10.64.72.109   <none>        4444:4444/TCP   449d
[17:44:32] <cdanis>	 the Kubernetes resource for the NodePort does exist in eqiad
[17:45:02] <cdanis>	 and I've checked by hand on one random k8s host (kubernetes1012) and I don't see anything in `iptables -nL` output about dport 4444
[17:45:19] <bblack>	 is port 4444 new?
[17:45:19] <cdanis>	 whereas I do see this, for instance:
[17:45:22] <rzl>	 no
[17:45:24] <cdanis>	 MARK       tcp  --  0.0.0.0/0            0.0.0.0/0            /* cali:4qSdb5zElt3Q5jJs */ /* Policy mw-api-int/knp.default.mediawiki-main ingress */ multiport dports 4446,8080,9117:9118,9151,9181,9361 MARK or 0x10000
[17:45:31] <mutante>	 in previous incident 2024-04-03 calico/typha down 
[17:45:41] <rzl>	 4444 is the mw-debug port: https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#Server_groups
[17:46:04] <mutante>	 it has been said back then "calico controller being down would only be an issue if there was also a deployment"  and this was the case again?
[17:46:08] <cdanis>	 I don't see anything about mw-debug or mwdebug
[17:46:34] <rzl>	 (which is the good news, this is only affecting mw-debug traffic atm, so it's blocking scap but no direct user impact)
[17:47:12] <bblack>	 well, so long as it's not impacting unrelated traffic through lvs1019
[17:47:25] <rzl>	 fair
[17:48:09] <cdanis>	 okay, apparently those kinds of iptables objects not existing everywhere is normal
[17:48:37] <cdanis>	 and they do appear on exactly 2 `A:wikikube-worker-eqiad` hosts, so, that checks out
[17:49:14] <rzl>	 that'll be the hosts with the two mw-debug pods that are actually running, yeah
[17:49:25] <bblack>	 doesn't seem like it should cause anything weird though
[17:50:23] <rzl>	 I really should know the answer to this question already, but: are more than those two hosts *supposed* to be answering health checks on 4444?
[17:50:48] <akosiaris>	 cdanis: -t nat and there should be rules on all nodes DNATing that port
[17:50:57] <bblack>	 yeah from the basic end of the lvs config, lvs expects 4444 to work on the whole kubesvc fleet
[17:51:14] <akosiaris>	 I am putting kids to sleep, so can't respond, just comment on IRC 
[17:51:22] <cdanis>	 thanks akosiaris 
[17:51:28] <bblack>	 well I guess there's another routing layer too
[17:51:41] <rzl>	 oh that DNAT rule will be the answer to my followup question then
[17:51:51] <rzl>	 ("how did that ever work")
[17:52:34] <akosiaris>	 And yes all nodes should answer to hits on 4444, but they aren't really answering themselves but forwarding traffic to the pods
[17:52:46] <rzl>	 nod
[17:53:01] <cdanis>	 okay, those rules are on 100% of nodes
[17:53:05] <cdanis>	 and they are identical across nodes as well
[17:53:23] <akosiaris>	 Look if those pods are so saturated or something that they are croaking and aren't able to respond 
[17:53:29] <rzl>	 so: did the forwarding stop working, or did the pods stop answering -- second one sounds more likely
[17:53:40] <akosiaris>	 In that case they may be taken out of rotation 
[17:53:55] <akosiaris>	 In which case connection refused will be returned
[17:53:57] <rzl>	 they look healthy: https://grafana.wikimedia.org/goto/aFxk0nySg?orgId=1
[17:54:15] <rzl>	 workers mostly idle, low resource usage etc
[17:55:19] <cdanis>	 akosiaris: there's also no events in the kubectl get events, and connections are hanging on connect, rather than being refused
[17:55:34] <cdanis>	 the SYN packet goes nowhere, is what it looks like
[17:56:52] <akosiaris>	 Hmmm, weird
[18:01:29] <akosiaris>	 did parse1002 decide to wake up from the weird slumber it was in?
[18:01:54] <akosiaris>	 talking about https://phabricator.wikimedia.org/T363086
[18:01:56] <mutante>	 temp and then died again. Dell is going to send a new mainboard now
[18:02:49] <mutante>	 well, it's up right now
[18:02:50] <akosiaris>	 well, it's probably somewhat fixed because it's sending heardbeats
[18:02:57] <akosiaris>	 but not entirely ok?
[18:03:04] <akosiaris>	 and it does host 1 pinkunicorn pod
[18:03:34] <rzl>	 oh, did we not re-cordon it?
[18:03:53] <akosiaris>	 I just cordoned it again
[18:03:59] <akosiaris>	 but it was in state READY
[18:04:28] <rzl>	 there goes the pod
[18:04:33] <akosiaris>	 forcefully deleting the 1 pod it had
[18:04:45] <akosiaris>	 !log kubectl -n mw-debug delete pods mw-debug.eqiad.pinkunicorn-6d4d68cd79-nq695
[18:04:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:02] <rzl>	 annnnd that was it
[18:05:04] <akosiaris>	 and pybal recovered
[18:05:13] <rzl>	 thanks akosiaris <3
[18:05:57] <akosiaris>	 ok, in the incident review write a line "where we got lucky? Alex did a kubectl get pods -o wide just out of habbit, and saw by chance parse1002 and put 1+2 together, said it equals to 12 and whatever from there on"
[18:06:35] <akosiaris>	 and if I my kids ever see the above in bash or whatever, just know that I was joking.
[18:06:39] <rzl>	 ahahaha
[18:07:16] * akosiaris off again
[18:07:30] <rzl>	 other pods on parse1002, apart from daemonsets: two linkrecommendation and one toolhub, both in status error
[18:07:34] <rzl>	 I'll delete em both as well, I guess
[18:08:03] <bblack>	 is there some deeper learning here about half-dead servers taking out a bunch of stuff?
[18:08:45] <rzl>	 yeah, it might actually be worth writing this one up and discussing it
[18:09:00] <akosiaris>	 something about silver and garlic ? or whatever it is that repels zombies? 
[18:09:10] <akosiaris>	 ok, enough jokes, I am off
[18:09:13] <rzl>	 the easy process answer is "make sure we cordon a machine if it's haunted" but it also shouldn't have hurt so bad when we hit ourselves
[18:09:46] <rzl>	 thanks again akosiaris, have a good night
[18:11:41] <cdanis>	 rzl: this is the *second* Interesting K8s Thing worth discussing in as many days
[18:11:44] <cdanis>	 thanks akosiaris 
[18:12:10] <rzl>	 yeah
[18:13:23] <cdanis>	 now we're just left with `PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled` from this morning
[18:13:32] <cdanis>	 which is probably worth us cleaning up, rzl 
[18:14:05] <rzl>	 oh, ye
[18:14:14] <rzl>	 one sec while I put these three pods out of their misery, then refill my coffee
[18:15:37] <cdanis>	 hm it's running
[18:16:40] <rzl>	 !log evacuate cordoned node parse1002: kubectl -n linkrecommendation delete pod linkrecommendation-internal-load-datasets-28616700-7gsqs; kubectl -n linkrecommendation delete pod linkrecommendation-internal-load-datasets-28616700-xl7t4; kubectl -n toolhub delete pod toolhub-main-crawler-28616760-jrhbb  # T363086
[18:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:45] <stashbot>	 T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086
[18:25:26] <rzl>	 okay, back
[18:26:28] <cdanis>	 I am not sure what is wrong
[18:30:25] <cdanis>	 the pybal management port? (9090) which is being queried by the nrpe is indeed still reporting the critical
[18:35:40] <cdanis>	 meanwhile we also have this
[18:35:43] <cdanis>	 💔cdanis@lvs1020.eqiad.wmnet ~ 🕝☕ sudo ipvsadm -l -t 10.2.2.8:6443 --stats
[18:35:45] <cdanis>	 Prot LocalAddress:Port               Conns   InPkts  OutPkts  InBytes OutBytes
[18:35:47] <cdanis>	   -> RemoteAddress:Port
[18:35:49] <cdanis>	 TCP  kubemaster.svc.eqiad.wmnet:6    37962   327308        0 67832843        0
[18:35:51] <cdanis>	   -> kubemaster1001.eqiad.wmnet:6    22150   171479        0 34405585        0
[18:35:53] <cdanis>	   -> wikikube-ctrl1002.eqiad.wmne     3185    31781        0  7000897        0
[18:35:55] <cdanis>	   -> kubemaster1002.eqiad.wmnet:6     3184    28055        0  5917272        0
[18:35:57] <cdanis>	   -> wikikube-ctrl1001.eqiad.wmne     3184    32038        0  6962523        0
[18:35:59] <cdanis>	 so like ... it's working and being used
[18:37:05] <bblack>	 if nothing else seems obvious, it's possible that pybal has built some bad internal state
[18:37:38] <bblack>	 we know there's some logic flaws in there, which can be exercised when there's lots of depool/repool happening while lots of health probes failing and trying to go below the depool_threshold.
[18:37:48] <bblack>	 sometimes it can just end up with bad internal state for some backends and will have to be restarted
[18:37:59] <bblack>	 it's rare, but it has happened before
[18:38:06] <cdanis>	 got it
[18:38:09] <cdanis>	 bblack: https://phabricator.wikimedia.org/P63596
[18:38:10] <cdanis>	 I think that this happened here
[18:38:52] <bblack>	 maybe restart lvs1020 first and wait a few, to make sure that one's clean + sane before 1019 blips real traffic
[18:39:01] <cdanis>	 the error isn't happening on lvs1019 :)
[18:39:03] <cdanis>	 just on lvs1020
[18:39:05] <bblack>	 oh ok
[18:39:06] <sukhe>	 https://sal.toolforge.org/log/lioyxY8BxE1_1c7sKbsl
[18:39:08] <bblack>	 just restart it then :)
[18:39:09] <dancy>	 Hi folks.  Different httpbb test failure during train rollout:
[18:39:09] <dancy>	 ```
[18:39:09] <dancy>	 18:38:18 Check 'check_testservers_k8s' failed: Sending to mwdebug.discovery.wmnet...
[18:39:09] <dancy>	 https://donate.wikimedia.org/w/index.php?title=Special:FundraiserRedirector&reloaded=true (/srv/deployment/httpbb-tests/appserver/test_foundation.yaml:28)
[18:39:09] <dancy>	     Location header: expected 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&utm_medium=spontaneous&utm_source=fr-redir&utm_campaign=spontaneous', got 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&wmf_medium=spontaneous&wmf_source=fr-redir&wmf_campaign=spontaneous'.
[18:39:10] <dancy>	 ```
[18:39:10] <sukhe>	 lvs1019 was restarted
[18:39:48] <dancy>	 `FAIL: 131 requests sent to mwdebug.discovery.wmnet. 1 request with failed assertions.`
[18:39:59] <rzl>	 dancy: so the params changed from `utm_` to `wmf_`, was that intended?
[18:40:13] <rzl>	 if not the tests caught a bug, if so the tests weren't updated :)
[18:41:04] <cdanis>	 really missing the way gerrit colors diffs rn
[18:41:59] <dancy>	 Hmm.. it would be helpful if the "expected" and "got" values were aligned in the error output.
[18:42:04] <dancy>	 Like so:
[18:42:14] <dancy>	 ```
[18:42:14] <dancy>	 Location header:
[18:42:14] <dancy>	 expected 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&utm_medium=spontaneous&utm_source=fr-redir&utm_campaign=spontaneous',
[18:42:14] <dancy>	      got 'https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=XX&wmf_medium=spontaneous&wmf_source=fr-redir&wmf_campaign=spontaneous'.
[18:42:14] <dancy>	 ```
[18:42:48] <rzl>	 patches welcome :D
[18:42:56] <dancy>	 👍🏾
[18:43:20] <rzl>	 (or assign me a task if you'd rather, I just can't promise it'll happen right away)
[18:43:38] <dancy>	 Seems like a simple enough one to take a stab at myself.
[18:43:45] <rzl>	 sweet, happy to review
[18:44:12] <cdanis>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/httpbb/+/refs/heads/master/httpbb/main.py#185
[18:44:53] <cdanis>	 rzl: do you know if any clients are depending upon one output line == one error right now?
[18:45:17] <sukhe>	 cdanis: seems like the restart fixed it?
[18:45:21] <cdanis>	 sukhe: indeed
[18:45:46] <rzl>	 cdanis: I don't know of anything that parses the output at all
[18:45:51] <sukhe>	 cdanis: ok!
[18:45:55] <rzl>	 it's all human-facing afaik
[18:46:05] <rzl>	 (except the exit status)
[18:46:06] <cdanis>	 sgtm :)
[18:48:56] <dancy>	 rzl: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037149
[18:49:18] <rzl>	 dancy: utm_campaign too, yeah?
[18:49:30] <rzl>	 oh and utm_source, it's all three
[18:49:50] <dancy>	 ah yes. thanks.. fixing
[18:56:21] <dancy>	 rzl: Fixed.  I was way off on the first attempt.
[18:57:06] <rzl>	 dancy: lgtm, as soon as jenkins is happy I'll +2 and merge it
[18:57:13] <dancy>	 thx
[18:57:48] <rzl>	 normally we'd do a dance around making sure the tests are passing so the alert doesn't pop -- in this case we'll just break the test since you're about to deploy the thing that fixes it, and if it yells in the meantime we'll ignore it
[18:58:12] <rzl>	 but if you end up rolling back the train, we'll revert this too and then be more thoughtful :)
[18:59:00] <dancy>	 nod.. I was considering that possibility. :-/
[19:00:36] <rzl>	 (sorry, I mean, not that you were thoughtless, this is exactly what I'd do)
[19:01:07] <dancy>	 Nod.. understood.
[19:08:22] <rzl>	 dancy: merged and ran puppet on deploy1002 -- httpbb now passes on mwdebug and fails on mw-web, as expected and the opposite of before, you should be good
[19:09:05] <dancy>	 Thanks!! Proceeding.
[19:22:50] <cdanis>	 rzl: dancy: I got mildly sniped https://i.imgur.com/M8BBux8.png
[19:23:24] <dancy>	 Nice.. I was just thinking about adding in color diff.  
[19:23:33] <dancy>	 I'll let you do that after my changes.
[19:23:51] <rzl>	 whoa, neat
[19:24:01] <cdanis>	 I'm unsure about the spacing thing
[19:24:05] <rzl>	 gonna drag this tool kicking and screaming into the late 20th century
[19:24:21] <cdanis>	 it is potentially confusing every way I've thought of to do it
[19:24:26] <cdanis>	 so I might take that bit out
[19:25:02] <rzl>	 I think I would drop it in part for copypastability reasons
[19:25:08] <cdanis>	 yeah
[19:25:12] <rzl>	 it's a cool idea though, I'm not 100% against it
[19:25:39] <cdanis>	 https://i.imgur.com/10ogGAC.png
[19:29:17] <rzl>	 hmmmm what if the unit was words instead of characters
[19:29:31] <rzl>	 it'd be nice to highlight `wmf` or even `wmf_source` rather than just `w` and `f`
[19:29:41] <rzl>	 maybe I'm overfitting on this specific example though
[19:30:00] <cdanis>	 do we have historical logs of httpbb test failures rzl 
[19:30:44] <rzl>	 not centralized ones afaik -- you could try the systemd timer logs on the cumin hosts
[19:31:12] <rzl>	 I think they mostly won't be semantically-interesting failures like this one though, mostly connectivity hiccups
[19:31:22] <cdanis>	 yeah
[19:51:46] <dancy>	 rzl: FYI httpbb tests don't work as-is right now:  https://integration.wikimedia.org/ci/job/tox/657/console
[19:52:59] <rzl>	 that's odd, I guess that error message changed with a version update or something
[19:53:06] <rzl>	 I'll fix that today, thanks for the heads up
[19:53:54] <cdanis>	 rzl: if it's good with you I'd like to do a rolling restart of a bunch of mw pods in codfw once the admin_ng/otelcol change is deployed and stable
[19:54:50] <rzl>	 sgtm, maybe we wait until after the backport window coming up
[19:55:30] <rzl>	 (or just use the backport as a rolling restart, depending)
[19:55:36] <cdanis>	 perrrrfect
[20:16:55] <cdanis>	 rzl: well, both patches I did today to otelcol config in codfw certainly did something: https://grafana.wikimedia.org/goto/y1f487sSR?orgId=1
[20:17:59] <rzl>	 mmmm yes I think I can almost see the change there if I squint
[20:45:52] <cdanis>	 20:43:42	<+jinxer-wm>	FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[20:46:07] <cdanis>	 this happened during a deployment but I think it is a leftover of h.nowlan's work earlier https://sal.toolforge.org/production?p=0&q=kubernetes2032&d=
[20:46:40] <rzl>	 yeah was just about to point you to that rename
[20:47:03] <cdanis>	 I put in a silence
[20:47:13] <rzl>	 not sure offhand what the lingering artifact is but I agree there must be one somewhere
[20:47:38] <rzl>	 at some point I'm going to have to block off a week, decline all my meetings, quit IRC, and just properly learn how calico works
[20:47:52] <cdanis>	 I learned a little of that today
[20:47:55] <cdanis>	 ... a little
[21:30:16] <jhathaway>	 denisse: per this task, https://phabricator.wikimedia.org/T274377, do you know if the sre-email-bot account is in use?
[22:27:37] <cwhite>	 jhathaway: forwarded you an email about sre-email-bot
[23:00:18] <jhathaway>	 cwhite: thanks!