[04:37:40] <_joe_> inflatador: you should be able to ssh via drac to open a second session_ [05:40:14] inflatador: in some cases running "racadm racreset" helps, failing that you need to open a task for DC ops [06:00:38] <_joe_> moritzm: I guessed he can't ssh inot the drac from what he said [06:28:46] <_joe_> eoghan / jelto: say I want to build a docker image in my gitlab CI pipeline [06:29:03] <_joe_> and then publish it to our registry [06:29:35] <_joe_> and no, this is not something we can do with the deployment pipeline, it's what lives before that. [06:30:57] <_joe_> is that possible? can I use docker or is there a specific engine I should target? [06:43:40] _joe_ currently we only do it with blubber and kokurri abstraction (https://gitlab.wikimedia.org/repos/releng/kokkuri). afaik we don't build raw docker images [06:44:19] <_joe_> jelto: ok, kokurri uses what container engine? [06:44:32] buildkit [06:44:34] <_joe_> sigh [06:44:49] <_joe_> so no podman or docker in CI? [06:45:22] <_joe_> how do we even use our base images with buildkit? well I'll have to learn a bit more about buildkit I guess [06:47:18] The only pipelines I'm aware of for image building are blubber + buildkit [06:47:49] <_joe_> jelto: yeah that's not what I meant, don't worry, you gave me the information I wanted [06:49:11] but I see that not all use-cases are covered by buildkit an blubber. We might have to look into building "other" images as well [06:50:29] <_joe_> jelto: I'm specifically looking at docker-pkg built images [06:50:59] <_joe_> I want to move things like production-images to gitlab and use a pipeline to build and deploy the images [07:03:08] ok there normal Dockerfiles are used afaik. This does not work with kokurri out of the box I think. You can use buildkit as your build engine with normal Dockerfiles (I do it locally). But we have to test and integrate it into the pipeline [07:05:15] <_joe_> jelto: yeah that was my idea basically :) [11:00:50] go effie [11:00:53] lol [11:01:00] now with /go [11:02:34] :P [11:02:36] :) [12:09:47] herron: let me know when you are around [12:10:06] we can have one last go before picking this up next week [13:54:10] herron: ping :) [14:03:10] effie: hey [14:03:25] :) [14:03:37] herron: shall we have another go ? [14:03:54] we have turned debug on on codfw, hoping to get more data [14:04:05] sure, nice I was going to ask about that too [14:05:14] I had a look at the tegola ca-certificates yesterday afternoon and I see the same thing with the ca cert present there afaict, but I also wonder if its worth installing (maybe live hacking on one instance?) the latest version just to rule that out for good? [14:05:29] we have done that already [14:06:00] the container has the "latest" wmf-certificates [14:06:37] curl works from the container, so whatever it is, we assume is on the go application [14:07:07] gotcha, yeah I was going off the date in the version 2021-11-18, so that already contains the latest or its been updated since? [14:07:44] but at any rate yes lets give it another go, I'll get the deploy to codfw again [14:09:05] cheers, let me know when puppet has run on all -fe hosts [14:09:16] I will get a tcpdump there too [14:10:45] will do [14:11:08] when you say all -fe that's all codfw yea? [14:16:16] yes [14:16:31] effie: puppet runs in codfw just finished [14:16:38] great, thank you [14:19:13] <_joe_> keep in mind it will take some time for envoy to restart [14:20:33] <_joe_> effie: is it me or it's working? [14:20:43] <_joe_> I'm looking at the logs from a tegola pod in codfw [14:21:01] there is no traffic yet, but I do see the self test not complaining [14:21:27] I have not restarted any pods yet btw [14:21:50] <_joe_> don't [14:22:22] <_joe_> GET /tegola-swift-codfw-v002/shared-cache/osm/6/23/24 is successful, right now [14:22:41] I know [14:22:48] we are seeing the same logs :p [14:22:49] <_joe_> just send traffic to it, we'll see if I am wrong [14:22:56] <_joe_> I'm pretty sure we're ok [14:22:57] pre-pod-restart/ [14:23:00] I was about to [14:23:17] nothing has changed apart than adding debug on the s3 client btw [14:23:18] <_joe_> I am mostly worried by the amount of logs we'll generate [14:23:31] it will be quick and painful [14:24:23] <_joe_> effie: have you already repooled tegola in codfw? [14:24:33] just did [14:24:42] <_joe_> because right now I fear the issue is different, we'll see quickly enough [14:25:10] <_joe_> specifically, I think this problem is related to https://phabricator.wikimedia.org/T300119 [14:26:18] <_joe_> err arnoldokoth jhathaway we might be naughty and make maps pag.e [14:29:18] <_joe_> effie: everything is working ok now [14:29:24] no [14:29:29] I am depooling ] [14:30:03] the other end stops responding [14:30:17] I see any gets but no responses [14:30:45] <_joe_> effie: you're not looking correctly unless I'm missing something [14:30:58] I am looking at the debug log of the s3 client [14:31:11] while it was actually working, I could see responses [14:31:34] and then, I started seeing the client sending gets, but no replies [14:32:00] and then I saw many inflight requests on tegola, which is what has been happening all along [14:33:11] <_joe_> effie: ok something's not right *in thanos* [14:33:17] seeing the TLS handshake timeouts again as well [14:33:37] that lines up with the tcpdumps [14:33:45] <_joe_> thjis is thanos in eqiad [14:33:48] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=thanos&var-origin_instance=All&var-destination=local_port_8888&viewPanel=6 [14:34:02] <_joe_> I would say there's a bigger problem than tegola [14:35:12] <_joe_> happened every time we moved codfw to the new certs [14:35:15] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=thanos&var-origin_instance=All&var-destination=local_port_8888&viewPanel=6&from=now-2d&to=now [14:36:18] <_joe_> effie: please depool, but this was useful [14:36:37] the only other thing I have seen in the -fe hosts, is the spike in attempt fail tcp errors, which in the past has been apps trying to talk to ipv6 ips where the port is closed, but I guess something will be in the dump [14:36:45] _joe_: I have already done so [14:37:42] <_joe_> herron: please don't revert for now [14:37:50] also seems to happen before cfssl certs but not sure what all of these correlate to https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=thanos&var-origin_instance=All&var-destination=local_port_8888&viewPanel=6&from=now-30d&to=now [14:38:13] _joe_: sure [14:38:17] for any party interested, there are tcpdumps in jiji@kubernetes2010:~/pre-pod-restart and thanos-fe2004:/home/jiji/pre-pod-restart [14:39:19] the pod IP in kubernetes2010 is 10.194.150.55 and thano's ip is 10.2.1.54 [14:39:36] <_joe_> herron: can you confirm the last two surges were in correspondence of the switch to the new certs? [14:40:10] <_joe_> but yeah that looks ugly [14:44:02] _joe_: negative, seems to begin at the top of the hour 12:00 on the 17th, while the patch was merged at 13:23 on the 17th with deployment a few minutes after that [14:44:20] <_joe_> herron: perfect [14:44:33] <_joe_> it means it's thankfully unrelated [14:44:53] 😅 [14:45:18] and fwiw the recent spike has settled by now [14:49:50] <_joe_> herron: where did you see the TLS handshake timeouts? [14:50:50] _joe_: seeing that here https://logstash.wikimedia.org/goto/4d16beee42e401ec1fa5084968add315 [15:47:10] herron: please roll back, we will try again on Tues, I am off on mondat [15:47:21] I will try to make time to update tasks related to the state of things [15:47:27] effie: ok, will do [15:47:32] find anything interesting today? [15:47:35] tahnks [15:47:59] just more confusion sadly [15:48:07] ahh good times [15:49:14] I'll revert shortly, just need to run my kid to afternoon care [15:49:23] sure, thank you [17:39:20] arnoldokoth: jhathaway: topranks and I are going to reboot the lvses in esams [17:39:38] site is depooled so nothing to worry, but if you see a page, just ACK it and let us know here [17:39:41] thank you [17:39:47] (I am downtiming) [17:40:13] ugh and I messed something up here give me a moment [17:47:59] cr1-esams just paged. [17:48:09] thanks [17:48:12] toprank.s is on it [17:48:20] just ACK it for now [18:09:59] arnoldokoth: all done on the lvs'es [18:10:04] no more alerts expected [18:21:23] sukhe: Cool. :) [18:52:24] 21:21:20 legoktm: Do you have any practical advice for how to test shellbox containers in staging? Asking in reference to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/949548/ <-- live hack ProductionServices.php to point to the staging service instead of the prod one. (I'm guessing that won't work with mw-on-k8s...) [19:48:01] anyone know where I can find the debian packing in git for wmf-certificates? [19:51:41] https://gerrit.wikimedia.org/r/q/project:operations/debs/wmf-certificates ? [19:52:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-certificates/+/refs/heads/main/debian/ [20:09:38] thanks dancy! [21:39:32] legoktm: that seems un-ideal as a deployment validation process. What I did was mostly just a process that verified that a container was running which is not really better. I guess someone (maybe me the next time I update syntaxhighlight) should take the time to write a little client that can validate a deployment better.