[07:23:36] _joe_: https://netbox.wikimedia.org/extras/reports/results/5017498/ " kubernetes2026 (WMF11797) Primary IPv6 missing DNS name" [07:24:25] <_joe_> XioNoX: why are you telling me, that's dcops :D [07:24:32] <_joe_> but yes, I can probably fix that [07:25:24] you're more awake than DCops and at this point I guess the server has been handed over to the service owner [07:58:16] docs at https://wikitech.wikimedia.org/wiki/DNS/Netbox#Add_missing_DNS_name_to_the_primary_IPv6_address if needed :) [08:47:16] <_joe_> XioNoX: the servers won't be installed today, but at the earliest on thursday :) [13:33:39] gentle bump re https://gerrit.wikimedia.org/r/c/operations/debs/mcrouter/+/860584 (free performance) [13:42:37] <_joe_> mszabo: heh yyeah :/ [13:45:58] there was a nice CPU utilization drop when we rolled this out last October, unfortunately I think I lost the screenshot [13:48:25] found the data https://usercontent.irccloud-cdn.com/file/sQIAlQGm/Screenshot%202023-09-19%20at%2015.47.58.png [13:54:54] reminder: services + traffic switchover starting in 5min; if I die (or if you want to follow along), tmux will be `switchover` (under user kamila) [13:55:31] * Emperor is going to stop restarting swift backends imminently (and will wait 'til the switchover is done) [13:55:38] please attach in read only mode (see https://wikitech.wikimedia.org/wiki/Collaborative_tmux_sessions ) [13:55:52] Emperor: thank you :D [13:55:58] volans: thanks :-) [13:56:27] kamila_: [done] [13:56:39] ty :-) [14:01:21] Traffic and services switchover starting people, let's go :) [14:02:17] who's doing the doing? [14:02:25] kamila_ [14:02:49] ok [14:03:12] tmux is on cumin1001, user kamila tmux session named switchover [14:03:17] and we are 4/60 right now [14:03:49] I expect the services to go well because I accidentally did it last week :D [14:04:06] and it was okay :D [14:04:07] sudo cookbook -d sre.discovery.datacenter status all for anyone who wants to follow the blinkenlichten, same host [14:04:23] we're depooling eqiad edge today too right? [14:04:25] oh, put it in a watch -n 5 [14:04:31] bblack: yes [14:04:58] also the tmux session scales to the smallest terminal attached, so someone just made it tiny [14:05:04] mybad [14:05:05] kamila_: If you don't want to get your window screwed up, I suggest setting the window-size to manual [14:05:21] claime: I'm aware, but I don't really care that much [14:05:25] ack [14:05:59] (the only reason I have such a huge terminal is so that I'm not the bottleneck for people following :D) [14:07:47] <_joe_> I would recommend keeping an eye on mw latency as we move things to codfw [14:07:50] <_joe_> mw in eqiad [14:09:04] _joe_: ack, thanks [14:10:13] !incidents [14:10:13] 4066 (ACKED) kafka-jumbo1015/Kafka Broker Server (paged) [14:11:32] appservers rps switching between codfw and eqiad, it's the ro services being switched around [14:13:04] latency taking a hit, but mean still around 150ms for eqiad api_appservers [14:13:56] looks like just a spike, it's back to normal [14:14:24] <_joe_> claime: the sesssionstore move will be what hits worst [14:15:27] <_joe_> claime, kamila_ I'll be deploying mediawiki manually to k8s now [14:15:36] ack [14:15:39] ack [14:15:44] <_joe_> it shouldn't impact you at all [14:16:34] (so if it does, it will be in a confusing way... ;-) ) [14:17:00] <_joe_> uhm I might need to use scap, actually [14:17:04] <_joe_> *sigh* [14:17:56] 2/3 of the way there [14:18:28] <_joe_> akosiaris: yeah I am mostly worried about the next part of the switchover, which is supposed to be the deployment server [14:18:31] MediaWiki backend response times elevated but not critical [14:18:37] <_joe_> and we need to amend the situation on mw on k8s now [14:18:57] Do the mw-on-k8s before we switch the deployment server [14:19:10] It shouldn't take too long for you to do that, we can release the scap lock for that time [14:19:19] Then do the deployment server switch [14:19:25] with mw-on-k8s in a known good state [14:19:44] <_joe_> ack [14:19:48] <_joe_> yeah let me do it [14:20:04] <_joe_> kamila_: can you remove the scap lock? [14:20:18] if everyone promises to be good :P [14:20:30] done [14:20:30] <_joe_> kamila_: I will use scap but it's absolutely safe(TM) [14:20:45] <_joe_> oh we're at zotero? [14:20:57] 46/60 [14:20:59] <_joe_> alex's favourite service [14:21:03] 47 is api-gateway for some reason [14:21:18] <_joe_> akosiaris: I think it's following the order in the service catalog [14:21:21] <_joe_> :) [14:21:29] maybe the cookbook should be ordering them alphabetically [14:21:29] Right now on the most critical of all services, similar-users [14:21:33] ahahahaha [14:21:39] wasn't that undeployed? [14:21:42] Not yet [14:21:53] I don't think we reached consensus on the task [14:21:56] ah, I 've seen the patches, probably not merged yet. [14:22:09] ah maybe [14:22:09] might as well be, probably [14:22:26] because dns wiped zero records/packets, so nobody's looking up that hostname on a regular basis :) [14:22:37] <_joe_> bblack: lol [14:22:57] _joe_: given that you want to be deploying, should I do traffic or deployment server first? [14:23:15] <_joe_> kamila_: I will be done in 1-2 minutes I think [14:23:54] MediaWiki backend response times steady just under 1s [14:24:18] <_joe_> claime: uh wat [14:24:19] ? [14:24:28] I see 619ms p95 [14:24:33] https://grafana.wikimedia.org/goto/d0cALQiIz?orgId=1 [14:24:35] This one [14:24:36] > 4066 (ACKED) kafka-jumbo1015/Kafka Broker Server (paged) [14:24:36] I just provisioned these brokers, they are not holding any data. Have I missed something pre-deployment that would have avoided paging? [14:24:39] <_joe_> akosiaris: you're watching different graphs [14:24:49] ah, ok [14:25:04] <_joe_> brouberol: I am not sure, but stopping puppet on the icinga server would've avoided the page happening [14:25:29] <_joe_> brouberol: I think the problem is that as soon as puppet has run on the host, icinga knows that host is a kafka broker on its next puppet run [14:25:34] <_joe_> and will monitor it accordingly [14:25:36] thumbor paged, acking [14:25:36] the cookbook should have been silencing those IIRC [14:25:50] !incidents [14:25:51] 4067 (UNACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [14:25:51] 4066 (RESOLVED) kafka-jumbo1015/Kafka Broker Server (paged) [14:26:02] <_joe_> jelto: ouch [14:26:13] <_joe_> I think we're over capacity for thumbor [14:26:18] <_joe_> we need to repool it in eqiad [14:26:19] ouch [14:26:30] there was no cookbook here, just me adding a broker host to a kafka role in puppet, and manually running the puppet agent on the host [14:26:39] because we don't replicate thumbnails and it's having to generate a ton? [14:26:42] ok to repool thumbor after the cookbook finishes? [14:26:47] kamila_: yes [14:26:50] thanks [14:26:51] kamila_: yup [14:27:14] stage is: k8s-ingress-wikikube-rw.discovery.wmnet is only pooled in eqiad: skip or move to codfw? [14:27:35] <_joe_> kamila_: sudo confctl --object-type discovery select 'dnsdisc=thumbor' set/pooled=true [14:27:36] brouberol: there's a hiera key you can set for a whole host to disable the usual alerts, that'd be one way [14:27:38] (not replicating thumbs seems like it will always be problematic on switchover, tbh) [14:27:43] <_joe_> I would do it now [14:27:51] Can the broker discussion move elsewhere please ? [14:27:56] <_joe_> bblack: I really don't understand why we stopped [14:28:00] <_joe_> anyways [14:28:03] cdanis claime: ack [14:28:04] _joe_: kamila_: repooling thumbor [14:28:11] <_joe_> wait [14:28:11] ack, thanks [14:28:13] me either, but I assume it was some pragmatic decision about some kind of load [14:28:14] <_joe_> thumbor is pooled [14:28:19] huh wat [14:28:29] we didn't already depool eqiad edge, correct? [14:28:31] <_joe_> ah wait [14:28:36] <_joe_> it's under ingress now? [14:28:38] cdanis: correct [14:28:39] cdanis: afaik it's still pooled [14:29:00] ok cookbook finished [14:29:17] <_joe_> kamila_: ok now we can try to understand what's going on with thumbor :) [14:29:22] unsure what to do about thumbor [14:29:22] taht [14:29:40] should I proceed with deployment server or not yet? [14:29:45] _joe_: It's not under ingress afaict [14:29:52] also, _joe_, can I put the scap lock back? [14:30:04] <_joe_> kamila_: go on [14:30:09] thanks [14:30:20] <_joe_> so, thumbor is called directly by swift [14:30:28] <_joe_> so we need to repool swift in eqiad I guess [14:30:34] <_joe_> or add replicas to thumbor in codfw [14:31:12] I don't see how we can add replicas given we're already resource constrained [14:31:17] <_joe_> yeah [14:31:22] yeah, that [14:31:25] <_joe_> so let's rollback the switch of swift [14:31:29] ack [14:31:33] can we manually sync thumbnails? [14:31:34] <_joe_> and we will do it after we added servers in codfw [14:31:43] _joe_: I think that also means we can't depool eqiad edge [14:31:48] which swift ? [14:31:50] <_joe_> cdanis: exactly [14:31:51] all of them? [14:31:54] because that will again effectively depool swift @ eqiad, and then, [14:31:58] yes [14:32:02] swift, swift-ro, swift-ew ? [14:32:03] <_joe_> cdanis: yeah [14:32:10] rw* [14:32:13] swift-ew... lol [14:32:16] <_joe_> all of those [14:33:06] <_joe_> cdanis: well the eqiad edge is a fraction of the traffic of eqiad's swift [14:33:12] thumbs> because the horrible mess that is the wmf-specific swift rewrite middleware was causing service outages because it would spawn a new thread for every thumbnail it was trying to chuck at the other DC [14:33:14] <_joe_> so I think we can [14:33:46] err swift-rw is A/P [14:33:51] So I can't repool it in eqiad [14:34:10] <_joe_> claime: star with the rest [14:34:14] swift and swift-ro repooled though [14:34:35] <_joe_> ok [14:34:39] we did depool edge eqiad it last time and for weeks in a row and I don't remember having such an issue. And IIRC we ditched the double writing already [14:34:40] given we haven't switched off the eqiad edge, I assume the induced codfw thumbor load was due to swift-rw traffic from mw/services? or? [14:35:14] maybe it's -ro though [14:35:15] <_joe_> bblack: no it was from the eqiad edge (and esams, and eqsin) reaching codfw's swift [14:35:31] ah right, ok [14:36:04] <_joe_> pooling swift-rw as well [14:36:12] thanks [14:36:15] <_joe_> can someone check thumbor in the meantime? [14:36:32] akosiaris: yeah, I think we stopped thumb "replication" ages ago, I'll check [14:36:37] 5xx going down on thumbor codfw [14:36:38] currently, thumbor in codfw has 16 replicas less than in eqiad [14:36:45] ouch [14:36:47] <_joe_> wat [14:36:48] that could explain why it's not enough [14:36:49] <_joe_> sigh [14:36:51] yeah [14:36:52] <_joe_> yes [14:36:56] ok, let's solve this first [14:36:58] that would do it [14:37:02] <_joe_> akosiaris: how? [14:37:12] what are we constrained on there? [14:37:18] we have more capacity in eqiad [14:37:19] Date: Mon Jul 25 11:47:26 2022 +0100 <-- when we stopped thumb "replication" [14:37:30] 2 less nodes in codfw [14:37:45] is the air-quotes becuase we were just triggering thumbor generation on the other side, rather than copying the output? [14:38:26] <_joe_> bblack: so turns out the problem is thumbor capacity in codfw [14:38:33] why do we have unequal capacity? [14:38:44] <_joe_> bblack: the person who could answer in on PTO [14:38:47] ok [14:38:55] <_joe_> but I guess that was part of adapting during migration [14:39:16] bblack: more because actual replication would involve checking for success or somesuch (and maybe also syncing deletions), rather than just "chuck it at the other DC and hope" [14:39:19] <_joe_> but now I would propose we move swift on thursday after we've expanded the k8s cluster in codfw [14:39:40] makes sense [14:39:45] <_joe_> ok, please move any non-strictly-switchover discussion (like: thumbs replication) elsewhere please [14:40:02] codfw 5xx thumbor rates now ok [14:40:07] does that mean I should not touch traffic until thumbor is dealt with? [14:40:19] [I don't think anything else on thumbs needs saying now, but if so, find me in #wikimedia-data-persistence] [14:40:21] * akosiaris doing a thumbor pods and k8s nodes comparison [14:40:28] !incidents [14:40:28] 4067 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [14:40:28] 4066 (RESOLVED) kafka-jumbo1015/Kafka Broker Server (paged) [14:41:13] Sorry for the page jelto [14:41:28] <_joe_> kamila_: no you should be able to move traffic away from eqiad tbh [14:41:40] yeah either way, both DCs are using discovery to reach swift [14:41:49] ok, thanks [14:41:49] (from the traffic edge I mean) [14:42:08] <_joe_> I don't think it's gonna be a huge huge impact [14:42:46] <_joe_> btw I have meetings for the next 1.5 hours [14:43:58] ok where are we at? traffic and deployment server left right? [14:44:08] yes, almost done with deployment server [14:44:12] ack [14:44:39] <_joe_> kamila_: oh ok :) [14:45:09] <_joe_> btw the main reason we do the switchover is to expose such asymmetries [14:45:27] yup [14:45:27] <_joe_> and/or when we don't have capacity to serve everything from a single DC [14:46:13] task failed successfully [14:46:19] yay :D [14:47:10] <_joe_> cdanis: don't mock my pep talk please :P [14:47:25] I was agreeing with it, not mocking it [14:47:39] kamila_: Normal, I think that job is on the secondary [14:47:41] Let me cehck [14:47:43] check* [14:47:46] oh [14:47:59] <_joe_> what happened? [14:48:11] Cronjob check sudo cumin deploy2002.codfw.wmnet 'systemctl list-units | grep -A1 sync_deployment_dir' fails [14:48:34] well it's not on either [14:48:54] Hmm [14:48:56] I'll re-run puppet and hope XD [14:50:18] so, we got the memory to startup more thumbor pods in codfw. Most nodes are at ~30% memory use with the highest on at 52% [14:50:39] and CPU wise the codfw wikikube cluster is at 26% currently [14:50:50] so, we definetely got the CPU too [14:51:06] we already trick the scheduler for thumbor anyway while we wait for the new hardware to be racked [14:51:16] akosiaris: Yeah but what about the CPU requests? That's what causes deployment issues for mw [14:51:54] claime: # TODO: This is a hack to trick the scheduler into starting more pods than we [14:51:54] # currently afford. Once we have more kubernetes hosts, we 'll remove [14:51:54] requests: [14:51:54] cpu: 100m [14:51:54] memory: 100Mi [14:52:01] ;-) [14:52:04] a'ight [14:52:13] that's what I alluded to when I said we trick the scheduler already [14:52:22] uh... [14:52:24] we only do that for thumbor though [14:52:32] and we need to undo that once we got the hardware [14:52:43] don't generalize the approach please ;-) [14:52:58] still no cronjob on deploy* [14:53:12] kamila_: there has to be something screwy with the modules/profile/manifests/mediawiki/deployment/server.pp L153 [14:53:20] yeah, apparently [14:53:25] <_joe_> kamila_: let me take a look at the puppet code then [14:53:31] thanks _joe_ [14:54:09] should I do traffic in the meantime or join the staring into the abyss? [14:54:50] <_joe_> kamila_: what is the patch where you changed the deployment server? [14:55:05] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/957736 [14:55:33] wait, race? [14:55:41] when jayme says "uh..." and nothing else I get a bit concerned [14:55:48] nope [14:56:05] cdanis: I meant "uh...thats fine" :-D [14:56:20] :D [14:57:42] #include [14:57:55] <_joe_> sigh I don't see a reason why the rsyncs wouldn't have been migrated [14:58:38] the cronjob should be on 1001, correct? [14:58:43] <_joe_> I still see https://phabricator.wikimedia.org/P52533 [14:58:53] 1002 but yeah [14:59:04] <_joe_> and those are on 1001 [14:59:24] I 'll fix thumbor's asymmetry capacity in a few, let me know if you want me to delay the deployment [15:00:05] <_joe_> uh kamila_ [15:00:07] <_joe_> systemd::timer::job { 'sync_deployment_dir': [15:00:10] <_joe_> ensure => absent, [15:00:13] <_joe_> no shit it's not there :D [15:00:21] oh XD [15:00:50] <_joe_> ok, I propose the following [15:00:58] <_joe_> akosiaris: deploys the new thumbor capacity [15:01:05] <_joe_> then we test scap from deploy2002 [15:01:18] <_joe_> I have a meeting now, I don't think I'm needed [15:01:41] <_joe_> uhm wait [15:02:05] * kamila_ fixing puppet [15:03:09] * kamila_ actually not sure how exactly it should be fixed [15:03:25] kamila_: Leave it alone for now, there's nothing to actually fix [15:03:27] 48 thumbor pods running now in codfw [15:03:29] It's dead code and old doc [15:03:31] OK, thanks [15:03:36] <_joe_> but yes, try a deployment. Foce a full image rebuild though [15:03:43] <_joe_> claime: ^^ [15:04:18] _joe_: k8s only? [15:04:37] <_joe_> claime: no, I'd try a full deployment tbh [15:04:42] a'ight [15:04:46] doing [15:04:46] so sync-world? [15:04:52] oh, ok, thanks claime [15:04:54] scap sync-world -D full_image_build:true [15:04:56] <_joe_> claime: the issue being, the release git repos are not synched between the servers [15:05:13] <_joe_> which si something we need to fix I think [15:05:16] _joe_: they should be, the rsync quickdatacopies should do it [15:05:39] <_joe_> claime: not /etc/helmfile-defaults/mediawiki/release [15:05:49] ugh [15:06:05] scap started [15:06:19] <_joe_> claime: yeah, but OTOH this might not be an issue AFAICT, let's see [15:06:35] why are all these RESTBase hosts complaining about the Salt article... [15:06:50] <_joe_> akosiaris: not sure, and it's somewhat preoccupying [15:07:04] <_joe_> let me try to check what restbase says in general [15:07:09] 503ing to the openapi/swagger service-checker..it's a bit worrying [15:07:13] <_joe_> they're all in eqiad, right? [15:07:18] yes [15:07:32] restbase is depooled in eqiad rn [15:07:36] which is... weird, cause mw is still in eqiad [15:07:52] ah wait, all just recovered [15:08:09] aside from the feed/announcements thing which we know the problem [15:11:19] <_joe_> akosiaris: there's an higher latency in general, and right now we have some problems with mw-web it seems [15:11:38] isn't somewhat higher latency expected? [15:11:59] <_joe_> kamila_: yes that is all good [15:12:12] <_joe_> the mw-web thing though is somewhat worrisome [15:13:44] mhm... [15:14:35] That's a weight distribution issue between main and canaries [15:16:01] <_joe_> I am not understanding the errors in restbase in eqiad, but given it's depooled from everything, I think we can move on [15:16:05] <_joe_> claime: how's scap doing? [15:16:18] It's building and pushing GB [15:16:21] Let him cook [15:17:29] Build and push done, it's now pulling the image on the k8s nodes [15:17:43] okay stupid question but if I don't ask I'll stay stupid: which mw-web thing? which graph are you looking at? [15:18:27] I really need to do a mw-on-k8s primer for everybody [15:18:34] claime: yes :D [15:18:48] mw-web is the mw-on-k8s deployment that serves non-api uncached user requests [15:19:08] https://grafana.wikimedia.org/goto/6MFfUQiSk?orgId=1 [15:19:10] Dash [15:20:41] okay, that's what I was looking at, but what is the issue, the latency spike? [15:21:35] <_joe_> kamila_: just that the canary host was shortly hosed [15:21:53] <_joe_> there was an alert in #-operations about having too few free workers [15:22:02] s/host/release/ [15:22:09] 2 pods out of the 14 [15:22:30] kamila_: The issue is the Apache workers saturation [15:22:38] top-right graph [15:23:10] why though ? [15:23:10] <_joe_> claime: s/apache/php/ [15:23:10] (this would be a lot easier if grafana stopped crashing in firefox -_-) [15:23:14] that's what I don't get [15:23:22] <_joe_> kamila_: disable hw acceleration [15:23:57] Because it has less leeway during a rolling release than main, you kill one pod, only one left to take the load [15:24:07] okay, I see it now :D [15:24:13] thanks (and sorry) [15:24:26] <_joe_> of what? [15:24:28] Sorry for what? [15:24:29] <_joe_> :) [15:24:32] Asking questions lmao? [15:24:34] <_joe_> lol [15:24:44] <_joe_> you should be sorry for saying you're sorry! [15:24:49] XD [15:24:50] <_joe_> :D [15:24:52] okay okay :D [15:25:03] <_joe_> jokes aside, I think once scap has finished we can do the traffic layer [15:25:10] (kamila and i may need apologizers anonymous) [15:25:10] yup [15:25:24] (oh hi ihurbain, yes XD) [15:27:52] scap deploy for canaries on codfw taking their sweet time, I'm afraid we're going to run into issues agfain [15:28:45] <_joe_> that's not great [15:29:10] I can't wait for those servers to be in the cluster [15:29:51] We should de-deploy mw-jobrunner [15:29:54] It's useless [15:30:49] sorry to ask, what is the state of switch, is everything done except swift? [15:31:07] Still fixing some stuff before doing the traffic switchover [15:31:13] ah, sorry [15:31:18] kubernetes throwing us curveballs [15:31:20] I got distracted for a second [15:31:34] will go back to shadows [15:31:48] claime: quotas? or no requests available ? [15:32:15] 66m Warning FailedScheduling pod/mw-web.codfw.main-5cdc5dd76d-zpxtj 0/24 nodes are available: 18 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate. [15:32:23] -_- [15:32:24] requests [15:32:50] <_joe_> claime: the rest of scap is working well? [15:33:08] It's rolling back canaries on k8s rn [15:33:17] So we'll see for the bare metal part in a bit [15:33:17] <_joe_> le sigh [15:46:11] I am doing a quick audit of a few services to get some resources back [15:46:26] akosiaris: So holding off on scaling back main? [15:46:40] yeah, gimme 10 mins [15:46:45] ack [15:53:17] would it be a terrible idea to switch traffic already? [15:57:19] it would probably be fine on a technical level, but if you've still got other issues going on, it might make things more confusing. [15:57:34] yeah, fair, thanks [15:59:06] <_joe_> I think you're clear kamila_ [15:59:15] <_joe_> claime: answered here :D [15:59:18] yeah [15:59:20] thanks [15:59:21] Just saw :D [15:59:27] Go forth kamila_ ! [16:00:04] and multiply? [16:00:30] And switchover [16:00:40] You can multiply if you want but that's not the point right now [16:00:43] :p [16:01:06] :D [16:06:41] traffic switched over, now staring at graphs [16:07:54] https://grafana.wikimedia.org/goto/q7BtuQmIk?orgId=1 [16:09:16] oh that one's pretty :D [16:09:52] https://grafana.wikimedia.org/goto/w1WJXQmIk?orgId=1 [16:10:44] yeah, I was looking at that one [16:13:31] nice [16:16:35] I think we are now pretty ok ? [16:16:42] ready for the big one tomorrow? [16:17:40] do I have a choice? :D [16:18:04] ahaha, always :-) [16:18:09] Someone familiar with startupregistrystats maintenance script ? [16:18:39] no, but I 'll look at the imagecatalog one that I 've seen error out in the previous switchover too [16:19:08] TypeError: Return value of BlameStartupRegistry::getInternalStartupJs() must be of the type string, array returned [16:19:14] Great. That tells me a lot. [16:20:11] extensions/WikimediaMaintenance/blameStartupRegistry.php ? what is this [16:20:11] This one's been failing since at least 1200 UTC [16:20:17] So not the switchover [16:20:39] migrated to periodic_job in 2020? [16:20:50] oh, during covid, that explains I have 0 memory of it [16:21:58] okay, traffic has been switched for a while, I don't see elevated error rates at the edge, so either I'm looking at the wrong place or we're good :D [16:23:14] LGTM [16:23:49] 🎉 [16:26:14] 🎆 [16:27:24] great job! [16:27:29] nice! [16:29:03] gg kamila_ ! [16:29:17] y'all did the hard part :D [16:29:39] I was just copypasting :D [16:29:56] (but I was copypasting in prod and didn't break everything, so there is that :D) [16:30:11] thanks for the help <3 [16:35:31] Things broke, but it wasn't your fault, and that's great :D [16:39:28] things breaking is exactly why we do it regularly [16:42:22] indeed- lots of changes since last time IMHO [16:42:35] let me see when it was the last one [16:42:41] March [16:42:56] wow, I would have sweared it was 2 years ago! [16:42:56] (switchback) [16:43:04] lol no [16:43:18] Great work. Well done kamila_ [16:43:38] I've only been here for 11 months, and I did a switchover and switchback. Now kamila_ has done the first part of their rite of passage into the ServiceOps team [16:44:18] Almost ready to be sacrificed to the SLO gods^W^W^W^W^Won call [16:44:39] Eeep :-D [16:45:01] What am I saying 11, 13 months [16:45:59] Hope we can get that sacrifice scheduled soon, the gods have quite the appetite [16:46:59] * kamila_ takes a few months off [16:47:08] x) [16:47:16] There have been some changes indeed, even in the 6 months, I'll make an attempt to make a list, or at least update the docs [17:43:12] Is the switchover complete to the point we can merge things for installs or should i hold off? [17:43:26] (not rushing anyone can wait if needed) [18:07:40] <_joe_> robh: there was no reason to stop IMHO [18:10:28] cool [18:10:31] thx for feedback! [19:06:23] Good job, kamila_ [20:44:42] I'm looking for grafana panels I can crib; Does anyone have any suggestions for table panels we have in use?