[06:33:14] <_joe_> ottomata: uhm the build images aren't supported since some time, I was sure we removed every reference [06:33:31] <_joe_> as for build cache, there's a command line option to build [08:58:40] Any experts in building go for Debian kicking around? I have a library package that doesn't build (errors suggesting some sort of type incompatibility) and am a bit confused; I can obviously ask the debian go folks to point out my Obvious Error if not... [09:02:02] <_joe_> Emperor: define "go for debian" [09:02:24] <_joe_> you want to build go the wrong way as stapleberg decided, or respect the language and do what it's supposed to do? [09:02:34] <_joe_> I guess the former given the issue you're having [09:03:10] <_joe_> Emperor: if they pinned a "go src package" to a specific version of a library that's incompatible with your software, TOUGH LUCK [09:03:17] <_joe_> you can't build that package in debian [09:03:28] <_joe_> unless you create a separate package for the new library version [09:04:02] <_joe_> Emperor: anyways, I'm happy to take a look [09:04:49] <_joe_> just be warned it will include rants about how debian really missed the point with all languages that produce statically linked binaries [09:43:54] :) thanks, I have a couple more Bad Ideas to try first, but when they don't work I'll be in touch [09:57:10] ο/. Can anyone think of an example similar to Chubby ( https://sre.google/sre-book/service-level-objectives/ ) in our history ? [09:57:47] that is people over relying on something because no clear expectations were ever set [09:59:17] <_joe_> akosiaris: etcd is constantly at risk ofc :) [09:59:44] <_joe_> but I've been guarding that with fervor [10:00:17] hmmm confd comes to mind [10:00:33] we "had" an incident recently didn't we ? [10:00:46] and by had, I mean I helped trigger it [10:01:42] <_joe_> well the incident wasn't an outage exactly because we've made things so that we don't rely on etcd too much [10:01:51] <_joe_> confd is part of the "etcd ecosystem" if you want [10:02:30] <_joe_> but yes, again, I see that as the best potential example, besides people flooding the action api without regard for concurrency [10:02:57] This incident? https://wikitech.wikimedia.org/wiki/Incidents/2022-05-01_etcd [10:02:59] <_joe_> think of the restbase-induced outages when a page on parsoid-js would make 100s of api calls to render the lua fragments [10:03:37] <_joe_> btullis: the outcome of that incident shows we *correctly* don't rely on etcd being available too much [10:04:08] <_joe_> basically the idea is that the site needs to work as designed even if etcd is down hard [10:04:50] _joe_: Agree. The only other confd etcd related incident that I remember was mostyl noise and mopping up, but not much impact to services. [10:05:51] (recent incident) [10:06:03] <_joe_> btullis: I repeatedly said we should make etcd unavailable for some time every quarter to ensure nothing has a hard dependency [10:06:17] <_joe_> but lately we've done a good job of bringing it down ourselves :) [10:06:35] yes we should. It is meeting it SLO with a very large marging IIRC [10:07:11] https://grafana.wikimedia.org/d/slo-etcd-tmpl/etcd-slos-grizzly-template?orgId=1 [10:07:25] <_joe_> yes, and IIRC we've taken care of making it large enough to exceed what we can accept for mediawiki [10:07:28] 99.99% and 100% [10:07:33] when target is 99.9% [10:07:47] Latency though we are interestinly failing!!! [10:08:20] <_joe_> are we? [10:08:26] oh, wait that the error budget left [10:08:36] I need to learn how to read panel titles apparently [10:08:37] how are you measuring latency btw? [10:08:42] internal process time? TTFB? [10:08:54] <_joe_> vgutierrez: d20 throw [10:08:59] vgutierrez: internal etcd metrics [10:09:01] _joe_: :P [10:09:05] _joe_: i'm more a d100 guy [10:09:08] <_joe_> akosiaris: I have seen the code [10:09:14] <_joe_> I know what I'm saying :P [10:09:18] Call of Cthulhu FTW... you damn D&D fanboy [10:09:29] so yeah, we have used like... 2.9% of our our budget? [10:09:33] <_joe_> vgutierrez: never played D&D [10:09:35] we are over performing by a lot [10:09:39] <_joe_> well not outside of the WMF [10:09:51] akosiaris: that seems pretty healthy to me BTW [10:10:05] <_joe_> vgutierrez: meh, for etcd it can be counterproductive [10:10:21] <_joe_> it might mean that developers get used to it being more available than we're promising [10:10:28] <_joe_> creating hard dependencies [10:10:30] vgutierrez: we shouldn't. That's the point. We should either introduce synthetic failures or tighten the SLO [10:10:36] <_joe_> and given how omnipresent etcd is [10:10:43] <_joe_> akosiaris: the former [10:10:45] or a combination [10:10:56] <_joe_> no I am radically against tightening the SLO [10:11:07] <_joe_> we need etcd to be loosely coupled to the infra [10:11:38] <_joe_> one way to ensure it is to make it less available than what the components depending on it would need it if it was tightly coupled [10:11:48] that's an interesting approach [10:11:56] <_joe_> vgutierrez: not my invention heh [10:12:03] <_joe_> it's the problem of all systems like etcd [10:12:16] so providing a bad SLO to avoid consumers relying to much on it [10:12:21] "bad" [10:12:37] vgutierrez: yeah that chubby planned outage example [10:12:46] https://sre.google/sre-book/service-level-objectives/ [10:12:59] <_joe_> vgutierrez: etcd is basically costco chubby [10:13:16] <_joe_> and I didn't want to go through the same global disastrous outage as google did years ago [10:13:22] It's pretty interesting as an idea. If people over-rely on something because it rarely fails, when it fails, things can fail bad [10:13:37] <_joe_> akosiaris: oh I have one! [10:13:50] <_joe_> but I won't say it on friday being oncall [10:14:29] akosiaris: chaos monkey should address that as well.. [10:14:31] whisper it in my ear? :P [10:14:49] <_joe_> akosiaris: DNS [10:14:58] vgutierrez: oh, I can? haven't touched much of production in a while and I got honorary "Chaos Monkey" badge in phab [10:15:02] for good reason :P [10:15:05] LOL² [10:20:43] speaking of confd, I need to update nginx on the conf servers next week, will sync up beforehand since we might hit that etcd-mirror crash again that we saw for the last nginx update [10:34:21] <_joe_> moritzm: that is relatively simple to fix but it pages yes [10:35:53] ack [10:39:42] <_joe_> I mean you just restart etcd-mirror [10:39:45] <_joe_> if it fails [10:40:08] hello folks, just to avoid heart attacks, I messed up with the benthos config and from 10:12 -> 10:22 there was an ingestion of more webrequests than what we sample. The end result is that in turnilo/superset there is big jump in requests, but it is nothing "real" (only Luca messing up). Apologies :) [10:43:03] <_joe_> elukey: as long as it doesn't page :P [10:45:02] not yet! :P [10:56:02] elukey: add a hypertension pill emoji to the topic? :P [10:56:07] * akosiaris just joking [11:09:00] elukey: lol [14:41:05] akosiaris: _joe_: to add some flavor to this, the particular thing they didn't want to be highly-available was 'global Chubby', which was a globally-replicated version of Chubby. (so yes, any writes would do Paxos between like 20 datacenters around the world, which was horrid.) most users actually used an automatically-updated snapshot of it in their local cell, but still that came with a lot of [14:41:07] risks [14:41:50] paxos in 20 DC globally... [14:41:54] <_joe_> yeah [14:42:06] <_joe_> and I cowardly refused to do raft between 2 [14:42:20] 👀 [14:44:16] <_joe_> akosiaris: want to degrade etcd's performance? we can go single-cluster [14:44:24] <_joe_> across eqiad and codfw [14:44:58] where is that whistling emoji when you want it... [15:10:29] akosiaris: I think they allowed each service about one write per minute [15:10:40] something like that [15:10:54] and tried to strongly discourage you from using it at all lol [15:25:47] <_joe_> 1 per minute lol [18:45:26] so gerrit has this new feature where it searches and highlights text I hover over [18:45:44] does anyone know where I can turn this off? I am looking under settings but can't seem to find it [18:46:19] in a CR I meant [18:47:47] sukhe: second to last under https://gerrit.wikimedia.org/r/settings/#Preferences, "disable token highlighting on hover" [18:48:34] oh man thanks [18:48:38] you are a lifesaver [18:49:32] I was getting a headache [18:50:39] 👍