[05:13:08] 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10Marostegui) p:05Triage→03Medium [05:13:50] 10serviceops, 10Patch-For-Review: Redirect docker-registry URLs with tags in them to the static /tags/ HTML page - https://phabricator.wikimedia.org/T283764 (10Marostegui) p:05Triage→03Medium [05:14:35] 10serviceops: httpd-fcgi image is not using numerical UIDs - https://phabricator.wikimedia.org/T283774 (10Marostegui) p:05Triage→03Medium [07:52:21] 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10JMeybohm) These times correlate with me doing docker-registry stress tests. My tests roughly ran in times: * Mi 19. Mai 12:55:54 UTC 2021 -> Mi 19. Mai 16:27:06 UTC 2021 * Do 20. Mai 07:27:26 UTC 2021 -> Do 20. M... [08:06:49] 10serviceops, 10DBA, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) Everything is done from either dbs and backup hosts side of things. Removing DBA tag [08:06:56] 10serviceops, 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) [08:33:54] 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10elukey) @JMeybohm can you add more info about those tests? Are those made from codfw to the docker-registry? (trying to understand how they could fit in the picture) [08:51:01] 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10JMeybohm) >>! In T283744#7118600, @elukey wrote: > @JMeybohm can you add more info about those tests? Are those made from codfw to the docker-registry? (trying to understand how they could fit in the picture) Su... [08:52:44] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Run stress tests on docker images infrastructure - https://phabricator.wikimedia.org/T264209 (10JMeybohm) [09:24:26] 10serviceops: httpd-fcgi image is not using numerical UIDs - https://phabricator.wikimedia.org/T283774 (10Joe) p:05Medium→03High a:03Joe [09:30:07] 10serviceops, 10Patch-For-Review: httpd-fcgi image is failing validation using its own test.sh - https://phabricator.wikimedia.org/T283774 (10Joe) [09:30:55] 10serviceops, 10Patch-For-Review: httpd-fcgi image is failing validation using its own test.sh - https://phabricator.wikimedia.org/T283774 (10Joe) This turned out to be more issues than I thought. There were several small issues with the test.sh script, but also a larger issue with the actual apache configurat... [09:34:46] 10serviceops, 10Patch-For-Review: httpd-fcgi image is failing validation using its own test.sh - https://phabricator.wikimedia.org/T283774 (10Joe) 05Open→03Resolved ` # build-production-images == Step 0: scanning /srv/images/production-images/images == Will build the following images: * docker-registry.di... [14:00:08] <_joe_> oh. my. god. https://github.com/gocql/gocql/blob/master/connectionpool.go#L257 [14:00:43] <_joe_> rzl / jayme / legoktm [14:01:14] gorgeous [14:01:21] breathtaking [14:01:37] an instant of perfect, heartstopping beauty [14:03:02] <_joe_> kudos to I guess eric who found that [14:03:39] <_joe_> rzl: you're in a clear https://en.wikipedia.org/wiki/Stendhal_syndrome moment [14:03:59] obama was president when that TODO was written btw [14:04:38] ooh, TIL [14:06:05] <_joe_> rzl: don't you know TODO's are the way programmers use to avoid being blamed for their lazyness by people in the future? [14:06:56] you and I experience TODOs very differently! [14:07:20] my history has been that TODOs are the way programmers avoid being blamed for their laziness by code reviewers in the present [14:07:29] <_joe_> also that yes [14:08:09] <_joe_> I was mostly talking of my recent experience of looking back at some code of mine riddled with TODOs and I was "oh that was cheap and lazy of me" [14:32:06] heh that's a pretty awesome find :) [14:34:38] my personal convention is TODO marks are "oh this would be nice to have someday, maybe, think on it some more, not that critical", and I use "XXX" marks in comments for more like TODO in the sense "You really need to fix this or make a really great excuse for it before this becomes part of any real release, because it's a real problem" [14:35:19] (and not many of those make it out of local development branches - I usually at least try to clean those up before pushing) [14:41:57] _joe_: wow, that's nice! Maybe send a patch to remove the word robust from the first sentence of the readme :-P [14:42:30] <_joe_> jayme: "robust" [14:42:45] :) [14:43:08] <_joe_> Gocql has been tested in production against many different versions of Cassandra. and it failed. [14:43:25] ...after some time or networking issues. [14:49:23] <_joe_> or someone moving too large continers [14:49:40] <_joe_> you basically went all evergiven on the codfw canal yesterday :P [14:49:49] hrhr [15:35:33] _joe_: .... wow [15:44:05] oh I scrolled around a bit before I closed the tab I had open on that gocql link, and there's also a TODO about a deadlock in there for extra fun [15:44:08] https://github.com/gocql/gocql/blob/master/connectionpool.go#L365 [17:41:14] LOL that's great [17:45:34] <_joe_> legoktm: it's a bit less fun when you remember our users' sessions depend on it :P [17:48:53] <_joe_> but at least, we know what to do if we see that alert again - a rolling restart of the pods [17:49:24] yeah... [17:50:06] _joe_: thanks for fixing the httpd image, going to try it now again [18:01:34] "Error: "main" has no deployed releases" [18:23:27] figured out how to purge the failed release [19:01:13] well progress [19:01:14] 2m45s Warning Unhealthy pod/shellbox-main-8497bcdf8f-5jmmx Liveness probe failed: dial tcp 10.64.75.197:9117: connect: connection refused [19:01:18] at least all the images are running now [19:02:38] the apache exporter is failing the liveness checks [19:03:40] <_joe_> that's pretty peculiar! [19:03:57] km@cashew ~ [2]> podman run --rm docker-registry.wikimedia.org/prometheus-apache-exporter:0.0.2 -scrape_uri=http://127.0.0.1:9181/server-status [19:03:57] flag provided but not defined: -log.format [19:04:09] did I acidentally pull in a newer version when I rebuilt the image? [19:05:01] ENTRYPOINT ["/usr/bin/prometheus-apache-exporter", "-log.format", "logger:stdout?json=true"] [19:06:43] or the opposite, an older version [19:14:13] yeah, I don't think we support log.format [19:14:15] I'll file a bug [19:14:34] the version in bullseye does [19:17:15] 10serviceops, 10MW-on-K8s: prometheus-apache-exporter in buster does not support -log.format json - https://phabricator.wikimedia.org/T283861 (10Legoktm) [19:48:10] When building Dockerfiles from blubber.yaml I always get the "USER 655533" in the resulting Dockerfile, regardless of setting the "runs: insecurely: to True or False on the top level of my blubber.yaml. If I manually edit the USER to USER 33 in my Dockerfile, then I can run httpd just fine. but I cant just pass through the USER line from blubber [19:49:47] mutante: did you set the uid/gid? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/libs/Shellbox/+/refs/heads/master/.pipeline/blubber.yaml#8 [19:50:06] I can have "runs: uid: 0 .. insecurely: True" in blubber and the resulting Docker file will have BOTH: USER 0 AND UER 65533 later [19:50:37] yea, I set uid/gid and insecurely, on the top-level of my yaml [19:50:42] looking [19:50:52] maybe you need to set it later on? hmm [19:51:08] tried inserting it into the variant section as well.. hmm [19:51:50] If i skip Blubber and just put "USER 33" in my Dockerfile then things work as I want [19:52:00] legoktm@deploy1002:/srv/deployment-charts/helmfile.d/services/shellbox$ curl https://staging.svc.eqiad.wmnet:4008/healthz [19:52:00] File not found. [19:52:09] progress! it's actually hitting PHP [19:52:20] congrats [19:52:20] x-powered-by: PHP/7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+buster1 [19:52:26] now to uh, find the missing file [19:54:24] I'll try building from the shellbox blubber locally and compare [19:57:39] I wonder how the health check is passing if it isn't serving the right file... [20:05:00] 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10jijiki) 05Open→03Resolved a:03jijiki Unless @JMeybohm was not the culprit, we can mark this as resolved