[08:25:00] elukey: slyngs: re: ganeti.reimage: I already did like 8 reimages with the cookbook plus the additions I've noted in the CR (manually changing the DHCP config beforehand). AIUI the DHCP thing (e.g. selecting the OS to use for reimage) is the only blocker (as it simply does not work this way) [08:27:34] jayme: Yeah, I talked about that with moritzm and there's a plan involving removing the OS image information from the Puppet DHCP config... I might be remembering that wrong, I'll followup [08:28:23] But thank you for doing the testing :-) [08:29:44] np. IMHO you could just drop the option from the cookbook for now and instruct the user to edit the dhcp config [08:30:05] it makes sense anyways because it should be persisted to the repo (at least after the reimage) [08:30:32] if that makes it easier (code wise) and the cookbook immediately useful, I'd say it's a win :) [08:31:21] That's a pretty good idea, we can always add it back if we want to change the flow in the future [08:31:37] ideed [08:34:17] jayme: o/ yes yes I figured after some tries, I basically copied your version of ganeti cookbook on cumin1001 to my deployment of the repo :) [09:46:44] no please, the DHCP stuff is already ready in spicerack since forever, let's not do frankensteins [09:47:06] no more need to store MAC in the repo at all [09:47:08] it's all automatic [09:50:12] also please don't run non-reviewed/non-merged cookbooks with SAL disabled in production unless it's in DRY-RUN mode or for the very first run before merging but *after* the code reviews just to spot silly errors [10:14:26] volans: Not sure I understand. Are you saying that all MAC's in modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 are no longer required? [10:15:20] will be no longer required once we officially merge the cookbook and move to a dynamic DHCP snipper like the physical host, yes [10:15:27] *snippet [10:15:47] the two things needs to happen at the same time [10:15:56] ah, okay so that's the plan. That was not clear to me [10:58:32] hmmm ceph-quincy third-party apt component seems to be currently broken, btullis could you fix it please? [10:58:59] https://www.irccloud.com/pastebin/tuPQr3sU/ [11:05:09] btullis, moritzm: I think that it could be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/886842 :? [11:05:19] looking [11:07:58] Also looking. Sorry for that. [11:08:36] I'll wait to merge it till the puppetmaster errors are addressed [11:09:32] I'm gonna need help for that [11:10:07] claime: did you force-push the wrong SHA1 earlier? [11:11:06] I didn't because I saw jynus merged something and I figured it would straighten things out [11:12:19] Basically my connection dropped during puppet-merge [11:12:36] https://phabricator.wikimedia.org/P43611 [11:12:40] That's my log [11:13:07] Then I tried to merge.py --ops on puppetmasters, but that failed because I hadn't specified --yes [11:13:27] Then I saw j.ynus had merged something new, so I didn´t run it [11:14:50] Should I connect to the alerting hosts and run puppet-merge there? [11:14:54] claime: what time was your tun [11:16:18] ahh sorry let me go the/me answering my own question 10:00:04 [11:17:09] I need to put datetime back in my prompt because I had no idea [11:17:33] I +2'd my change at 10:22:35 GMT [11:18:11] ahh yes you are right [11:18:11] 10:23:02 puppetmaster1001 /puppet-merge.py: (puppet) Merging: a697efe12b913ef15bb71a8dc0eb59f0cfaeefa6 -> e1047aa3e6898bba6f1de75cb4ba53cd5a57f456 [11:22:51] jbond: FYI let me know how it goes / if help is needed (clinic duty) [11:23:37] godog: thanks but i dont think there is a proble, i just merged a change and its all gone fine [11:23:57] claime: i suspect that jyn.us fixed things when he did his merged [11:24:28] oh ok, yeah I'm expecting a subsequent puppet-merge to make things right again [11:24:30] the error about puppetmaster[12]002 is a red hering. those two nodes have been marged for decomission [11:24:58] ah, got it, thank you that explains [11:25:02] Aaaaah [11:25:28] Can we downtime them if they're to be decommed? [11:26:04] claime: yes of course, i think something probably expired on them [11:26:08] ill decom them today [11:26:17] jbond: ok, downtiming for 24h [11:26:28] ack [11:28:23] jbond: done [11:29:39] I think vgutierrez can proceed right? [11:30:22] yes vgutierrez you can proceed [11:31:07] also going back on what i said i will not decokm them, ill put them back in servers. i rembered we are leaving theses around for a bit to make it simpler to migrate to puppet7 i the future https://phabricator.wikimedia.org/T314136#8589138 [11:32:42] Do your thang :p [11:38:28] does anyone know from where alertmanager sources the team tag? I think I spoted some mistakes (although I won't change anything without consulting people first) and want to make sure where the issue comes from [11:39:21] jynus: I think it's from the file structure, but I'm not sure [11:40:02] Huh, not even [11:40:03] I think there was some owner tagging efforts on hiera, but not sure if related [11:40:08] There a team: sre tag [11:40:29] ❯ git grep -c 'team: sre' [11:40:31] team-netops/netops.yaml:5 [11:40:35] in alerts repo [11:40:46] So looks like it's defined per alert [11:41:28] jynus: Here's where the team routes are defined, if that helps: https://github.com/wikimedia/operations-puppet/blob/production/modules/alertmanager/templates/alertmanager.yml.erb#L49 [11:43:37] vgutierrez: Did your patch fix reprepro as expected? [11:46:12] Amir1: marostegui: Are there read-only alerts for mysql defined in alertmanager? I grepped but didn´t find one, would like confirmation before merging https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/718936 [11:46:22] claime: Not that I know of [11:46:51] marostegui: Thanks :) [11:53:04] I think I will ask later someone from observability, because the issue would be the subtelties about tags meaning, not the code [11:53:17] *could [11:55:45] I haven't migrated any alert myself [11:56:12] And I don't think Am1r did any either [11:56:34] yeah, only thing I find is a general exporter test [11:56:55] Anyways we'll find out, won't we :p [12:00:35] I am getting this error on running puppet: https://phabricator.wikimedia.org/P43632 [12:04:18] jynus: works for me, I think it just hit the puppetdb while it's api was restarting [12:04:37] jynus: running 'run-puppet-agent' on dbprov2002 right? [12:05:10] ok, thanks, I thought it was something more permanent [12:26:43] godog: would it be possible to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/881602? I've tested it in toolsbeta and it works great [12:28:36] jbond | claime | godog: Not sure if that's related to the broken puppet merge earlier, but I caught puppet applying an old version of a file on kubestagemaster1001 on 11:51:04 [12:29:48] ultimately revoking access to k8s apiserver metrics for the prometheus user. It was fixed with the next (manual) puppet run [12:30:20] I wonder if it's because of a labs merge that wasn't done correctly or what [12:31:09] I've no idea but it looked veeery strange and made me wonder if there where other side effects as well [12:31:52] <_joe_> labs merges shouldn't count [12:32:05] <_joe_> has anyone checked the status of the git repos on every puppetmaster? [12:32:11] jayme: I got this error running puppet on puppetmaster earlier [12:32:13] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /etc/puppet/modules/profile/manifests [12:32:14] because the chage that got kinda rolled back was from 2023-01-24 [12:32:15] /kubernetes/kubeconfig/admin.pp, line: 12, column: 32) on node cumin1001.eqiad.wmnet [12:32:23] Resolved with another run [12:32:41] <_joe_> thsi means there's an host with outdated puppet possibly [12:34:04] I haven [12:34:16] havent checked anything else yet tbh [12:34:34] <_joe_> it looks like everything is ok now [12:35:06] maybe kubestagemaster1001's puppet run was just bad timing then [12:36:21] puppetmaster[1,2]002 have very old versions of the private repo, but should not be used since jbond declared them offline [12:37:20] kubestagemaster2001 applied that (old) change now as well [12:37:29] was okay a couple of minutes ago [12:38:37] 12:15:22 to be fair [12:38:57] ok wth [12:39:54] also fixed with the next puppet run [12:42:44] Wait a minute [12:42:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/886867/ puts them back in [12:43:09] but their private is WAY out of date [12:43:52] it's not anymore though [12:43:59] it was a few minutes ago [12:45:36] maybe it just required some time/a puppet merge after re-enabling them to update private? [12:45:52] Possibly yeah [12:46:45] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ae5624f5ced044796847552386d91f7813564263%5E%21/#F0 they where re-enabled for puppet7 migration [12:48:56] yeah but I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/886867/ is responsible [12:49:11] Or maybe not [12:55:40] <_joe_> jbond: ^^ [12:56:05] <_joe_> claime, jayme: check which puppetmaster compiled the catalog for kubestagemaster2001 then [12:56:24] <_joe_> grepping syslog on the puppetmasters should give you that [12:59:19] _joe_: jayme: claime: sorry this was my mistake i did not sync the private repo after bringing the 2 puppetm,asters back [12:59:23] i have synced them now [12:59:30] <_joe_> ack :) [12:59:40] is there any fall out i can help with [12:59:58] (1) puppetmaster2002.codfw.wmnet [13:00:00] ----- OUTPUT of 'grep "Compiled c... /var/log/syslog' ----- [13:00:02] Feb 6 12:15:18 puppetmaster2002 puppet-master[6644]: Compiled catalog for kubestagemaster2001.codfw.wmnet in environment production in 7.56 seconds [13:00:04] Yeah that's exactly it [13:00:17] When you brought them back, they went and got their catalog from 2002 [13:00:20] Which was outdated [13:00:23] <_joe_> i am worried about certs [13:00:37] <_joe_> claime: run puppet on the deployment servers, will you [13:00:43] claime: was the catalog outdated or just the private repo, the former should have been in sync [13:01:07] s/catalog/main production repo/ [13:01:13] jbond: I think the catalog was compiled with outdated info from the private repo [13:01:18] <_joe_> jbond: tbh i'd force-run puppet on any host which got complied there during that period [13:01:20] And yes, that too [13:01:21] ack that makes senses [13:01:26] _joe_: doing [13:01:38] jobo: ack ill; check the l;ogs opn the m,asters and do that now [13:01:42] starting with codfw, obvs [13:03:25] _joe_: Just some ownership changes on deploy2002.codfw [13:03:58] (the usual trebuchet/helm corrective crap) [13:05:22] looks good on deploy1002, no change [13:05:44] 13:00:45 +jinxer-wm │ (JobUnavailable) firing: (2) Reduced availability for job k8s-api in k8s-staging@codfw - < jayme think that's related to what you were saying earlier wrt kubestagemaster2001 ? [13:06:02] yes, that's what made me started looking [13:06:10] *start [13:06:22] well, the kubestagemaster1001 version of that alert [13:14:33] I re-ran puppet once again on kubestagemaster1001 as that one got the old version again shortly before j.bond did the sync [13:14:54] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3Dpuppet%20last%20run < Mpf [13:16:37] jbond: you re-running puppet rn? [13:17:18] yes [13:17:25] ok [13:20:51] taavi: yes, I can't today but I'll take a look later in the week [13:23:20] claime: fyi i am stopping my manual puppet run as all host will complete in there normal cycle in the next 6 minutes which is quicker then the cumin command will finish [13:23:44] jbond: ack [13:23:48] but every thing shuold be back to the correct state by 13:30 [13:32:38] jbond: is it ok to use puppet-merge right now? [13:32:50] vgutierrez: yes shuold be fine [13:33:00] ack [13:33:05] I'm merging Jbond: Revert "add whitespace to test puppet-merge" (f2d9f2c9b2) along mine [13:33:15] should be pretty innocuous [13:33:17] vgutierrez: thanks [13:33:17] (last famous words) [13:33:25] yes it is just whitespace in the README [13:33:45] done... all good [13:33:49] cheers [13:37:28] hmm this is "new" [13:37:30] vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 checkupdate buster-wikimedia [13:37:30] Nothing to do found. (Use --noskipold to force processing) [13:37:43] moritzm: ^^ [13:38:16] moritzm: maybe it's related to the layer8 here... [13:40:01] I'll have a look in ~ 5m [13:55:43] the updates are visible if you pass --noskipold, it flags a pending update to 2.4.21-1~bpo10+1, then (no idea why this isn't the default...) [13:56:24] yep.. I've noticed that [14:11:42] Hello, I'd like to welcome nfraison (https://phabricator.wikimedia.org/p/nfraison/) to the team. He's a new Senior SRE on the Data Engineering team, and he'll be working alongside Steve and me. I've invited him along to the SRE meeting later today and I hope to be there too. [14:12:14] welcome aboard nfraison! [14:12:35] welcome nfraison ! [14:13:11] welcome! [14:15:13] welcome nfraison [14:16:53] Hi folks [14:22:06] welcome! [14:24:42] welcome nfraison! [14:25:28] welcome nfraison [14:26:32] welcome nfraison ! [14:28:55] welcome! [14:34:21] Welcome nfraison ! [14:35:11] Welcome nfraison :) [14:57:58] welcome nfraison!