[09:30:07] jayme: I'm seeing a certmanager-related networkpolicy admin_ng change in dse-k8s-eqiad (probably linked to fabd5d0d). Is that safe to deploy? [09:30:58] effie: ^ [09:36:13] it seems this is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895696 [09:37:01] yes. effie is currently rolling out the change - IIUC tests in staging have been fine(?) so you're probably good...but I'd wait for her call [09:37:18] sg, thanks [10:13:20] brouberol do not apply anything until we debug or roll back [10:13:39] gotcha, I'm glad I asked [10:13:50] * brouberol steps away from keyboard [10:43:25] arnoldokoth: when running the sre.dns.netbox cookbook (or any other that calls this one at some point) please carefully check the diff before proceeding and if there are spurious diffs related to other hosts make sure to check with the owners of those changes/hosts before proceeding. [10:44:30] in the last merge there were spurious changes related to dns7001 that I think might be ok o merge but I can't be sure 100%. They were added yesterday night by Ca.thal as part of magru work. Not sure why they weren't merged (and he's off today) [10:57:03] also ntp wasn't restarted [10:57:34] jynus: yes, thanks, we will do after reimaging today [10:58:48] volans: Sure. Will make sure to check first next time. [10:59:01] thanks [11:09:29] brouberol: I think we have reverted to the previous state [11:09:56] you shouldnt have any changes no [11:09:57] now* [11:24:19] the dns7001 change seemed to have been adding A and AAAA records, so we are good there at least [11:30:48] Thanks effie, will do! [14:50:01] hi, there's a maintenance script throwing a big volume of errors that has been running for several hours now [14:50:08] https://logstash.wikimedia.org/goto/b5da5670c0af8f81a3f537057b944c3f [14:50:15] all errors with the same reqId [14:50:24] does anyone know who to ping about this so they can take a look? [14:57:42] ... [14:58:47] jnuche: I think if that's migrateLinksTable then Amir1 will know, see https://phabricator.wikimedia.org/T345733 [14:59:25] let me take a look [14:59:31] I looked at it in the morning [14:59:47] thanks folks [15:00:04] restarted it [15:00:11] let's see if that fixes the issue [15:00:25] it looks like an issue with old config being stored [15:02:39] jnuche: fixed now [15:03:31] Amir1: yeah, log errors have stopped, thanks! [15:03:40] \o/ [15:58:48] datahub-mae-consumer-main in kube staging has been spamming logstash pretty hard and causing lag, brouberol btullis would you mind taking a look at what's wrong ? [15:58:58] I'm looking at this https://logstash.wikimedia.org/goto/4924b8a5f441cffa15b345296d7fa3de [15:59:08] or anyone else familiar with the system FWIW [15:59:12] godog: Will do. [15:59:38] thank you btullis ! [16:04:16] the error that I see is the following: https://phabricator.wikimedia.org/P61497 (from pod logs) [16:05:12] Yup, I created https://phabricator.wikimedia.org/T363843 to track it. If necessary we can remove the deployment, but I'd rather see if we can skip the message. [16:05:55] ack super, let us know if you need help [16:06:34] Thanks, I can help from o11y side. [16:06:51] Cheers both. [16:16:19] I think it is fixed. Monitoring for stability now. [16:17:04] btullis: Thank you! [16:17:43] btullis: sweet! thank you, I can confirm the lag decreasing [16:34:06] just a reminder that the etcd work for T358636 will start in ~ 30m. all coordination will be in -operations. thanks! [16:34:07] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [16:34:31] thanks swfrench-wmf ! and gl :) [16:34:34] swfrench-wmf: thanks! we are running a few cookbooks over in Traffic but promise to stop by that. gl! [16:40:48] Sorry I was afk on baby duty [16:47:08] in about an hour I have to run to a ~1h appt, will bring the laptop along but may be delayed [16:49:21] herron: I can cover you. [16:49:39] thanks! [16:55:57] swfrench-wmf: we have stopped all Traffic cookbooks [16:56:00] all yours, gl :) [17:00:18] thank you all :) [17:34:15] does anyone know the relationship between LVS hosts and the service catalog ( hieradata/common/service.yaml in puppet )? Like if I classify my service as "low traffic," does it go to a certain set of LVS hosts or how does that work? [17:35:44] you add it to hieradata/common/services.yaml [17:35:53] inflatador: modules/profile/manifests/lvs/configuration.pp is the other side of the puzzle [17:35:55] -s [17:36:02] 'high-traffic1' => $::realm ? { [17:36:02] 'production' => $::site ? { [17:36:02] 'eqiad' => [ 'lvs1017', 'lvs1020' ], [17:36:06] ^ those parts [17:36:36] so high-traffic1 services in eqiad use lvs1017 and lvs1020 [17:36:57] the general layout is the first one in that array is where the traffic would normally route through, and the second one is a backup (generally a shared backup for multiple classes) [17:37:53] bblack sukhe ACK, thanks. Working on T363702 and trying to suss out where the pools come from [17:37:54] T363702: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702 [17:38:03] note that high-traffic[12] and low-traffic are just old labels. they don't necessarily have great semantic meaning in them anymore [17:38:08] low-traffic has quite a lot of traffic :) [17:38:45] Gotta love organic growth ;P [17:38:53] a better set of names in the present would be something like: "text-public", "media-public", and "private" [17:39:32] or maybe "internal" instead of private [17:39:58] but renaming things is hard, and they're emebedded and documented everywhere, etc [17:40:27] maybe there will be a good chance to fix it during liberica transitions next FY! [17:40:46] * inflatador was going to timidly suggest taht [18:11:12] was someone working on lists1004? [18:11:15] pending DNS changes [18:11:32] sukhe: arnoldokoth: ^ [18:12:09] thanks mutante. arnoldokoth ok to merge those? because they will block anyone else running the cookbook [18:12:21] eoghan: ^ [18:12:37] volans raised it with eoghan a bit ago in -operations [18:12:50] ah ok [18:13:01] I think going to be a bit bold and just merge this [18:13:08] because otherwise it blocks for everyone else [18:13:16] sukhe: Yeah. [18:13:24] arnoldokoth: thanks! [18:13:57] We were actually waiting for the etcd work to be completed. [18:14:14] ah fair [18:14:17] yeah it's done [18:14:56] Sure. Will now proceed. [18:22:57] bblack: please please please let's fix the naming [18:23:57] swfrench-wmf: thanks again! [18:24:45] do I understand correctly that all keyspaces (except one) are now replicated? that's nice [18:25:40] thanks, cdanis! many thanks to v.olans and rzl for the reviews and / or second pair of eyes / hands :) [18:28:02] exactly, yeah: rather than just /conftool, it's the entire / keyspace now, with one notable exception - the path where python-etcd locks are stored for spicerack (i.e., the locks that provide mutual exclusion for the long-lived spicerack locks). [18:28:42] those aren't meaningful across clusters, as they depend on the etcd cluster index [18:29:34] swfrench-wmf: on the topic of etcd, we have another patch that I wanted to merge https://gerrit.wikimedia.org/r/c/operations/dns/+/1025800 [18:29:49] basically, when we were copying over stuff for magru, we put them under the esams origin [18:30:07] so we need to fix that and this involves running a restart of confd [18:30:12] should be fine but I wanted to check with you [18:33:03] ah, interesting! taking a look ... [18:35:25] so, for the past 2w, _etcd._tcp.esams.wmnet has resolved to ... what? [18:35:38] yeah, that's the issue [18:35:45] we caught it today [18:36:01] when were depooling something in esams [18:36:50] got it, yeah this LGTM in terms of putting things back in the right state [18:37:03] thanks, deploying now [18:38:12] not 100% sure if the confd restart is needed (I know it doesn't re-resolve in the successful resolution case, but no idea in the nxdomain case) [18:38:55] we will clear the cache anyway but I guess the danger emoji is a good idea to restart it :P [18:39:22] that's exactly what I was going to say, yeah - better to go with the paved path here and restart in all of esams [18:42:16] sukhe@cumin1002:~$ sudo cumin 'C:confd' 'systemctl restart confd' [18:42:17] 2104 hosts will be targeted: [18:42:18] that's fun [18:42:38] today might be the day I break it [18:42:43] oh, I don't think you need to do it globally [18:43:39] so, it's really only going to be for the clients that use the esams.wmnet records, which as long as we've not done anything silly anywhere, should just be in esams [18:44:19] "couldn't hurt" so to speak to do it globally, but might not be strictly necessary [18:44:31] ok :) [18:45:53] sukhe@cumin1002:~$ sudo cumin 'C:confd and *.esams.wmnet' 'systemctl restart confd' [18:45:56] 26 hosts will be targeted: [18:47:35] that's a lot fewer :) [18:47:45] I did -b1 and -s10 and rolling out. thanks! [18:52:23] nice to see etcd fully replicated now, thanks a lot! [19:44:10] herron: I merged 'alertmanager: irc: remove second space' [19:44:17] andrewbogott: thanks! [20:38:29] fyi, I need to run an errand at 21:00 UTC (in ~ 25m). I don't expect anything weird to happen at this point, but in an abundance of caution: if anything smells like etcd, ping / mention me (I'll have my phone and laptop).