[02:23:48] Hi all i am reorganizing our Phab Board to a new tag, apologies for the phab bot spam [02:25:02] *phabricator [07:14:37] lmata: FYI, in future you can mute notifications too [08:06:40] Folks I've revised the proposed schedule for the planned maintenances on switch stacks in eqiad, reversing the order we do them in effectively (so row D first). [08:07:29] This is to accommodate some database migration work that needs to happen in advance of row A/B being touched. [08:07:49] Schedule has been updated on https://phabricator.wikimedia.org/T284592 [08:08:26] We will obviously try to work with everyone to ensure the timetable is ok, so please do advise if this is too soon etc. [08:08:27] thanks [08:17:11] topranks: hi! Thanks a lot for all the precise work and schedule, it really helps :) [08:18:39] I have to say that I wouldn't change everything for marostegui and kormat (kidding, <3 <3) [08:19:09] ಠ_ಠ [08:20:47] haha [08:28:31] topranks: I like the new schedule better actually, it means I can probably avoid potential impact altogether on "my" stuff! [08:28:58] so as far as I'm concerned, it's being changed for me, not for those dba types :-P :-D [08:29:10] ah great! the worry always with such a change is you make it even worse for everyone else. [08:29:26] I don't know about other service owners but yeah big +1 from me :-) [08:29:27] and of course, it was all for you, we wouldn't lift a finger for those pesky dba's :) [08:29:31] lololol [08:29:35] happy monday everybody! [08:35:48] Morning all. [08:57:27] apergos || jynus: we do lack a recording of the "WMF databases" onboarding chat and I'm not completely sure whom to ask to do it :) [08:59:01] ema: would you be so kind to place the slides for "The CDN" into the onboarding chats folder? [08:59:19] jayme: will do! [08:59:51] ema: <3 (I just renamed the video to "15 - The CDN" [09:00:35] great title, I approve [09:00:36] I think it might make more sense to have it at after "Networkd, ins and outs", though [09:01:02] eheh, the title is your creation I suppose [09:06:57] jayme: um. I've never given the slide dick for the dbs that I did (or fixed them up after some valuable comments from the dbas)... and they would need to be updated [09:07:13] should I do that? does someone else want to do a better presentation? [09:07:38] * apergos looks at marostegui [09:08:11] jayme: uploaded as 15-The.CDN.pdf [09:08:19] thanks ema [09:14:54] *slide deck!! ugh. mondays [09:15:46] just discovered jbond's puppet presentation, great stuff for a Monday! [09:18:26] :D slow monday ema? [09:18:36] got my old broken laptop to boot again.. after the brandnew one to replace it .. broke.. heh [09:19:11] jbond: I'm learning lots of stuff really! :) [09:19:34] especially about puppetdb [09:20:43] :D good good, ping if you have any follow ups. fyi for puppetdb stuff you should check out puppetdb_query [09:22:09] i think its only mentioned once on https://puppet.com/docs/puppetdb/5.2/api/query/tutorial.html but it allows you to use the full puppet api syntax including pql (https://puppet.com/docs/puppetdb/latest/api/query/v4/pql.html) from puppet (so a bit more flexible then the stuff in 'puppetdbquery' (the module) [11:41:10] ok, maybe someone can tell me what i'm missing here. i've added a new (well, renamed) host to puppet, but it's not showing up on icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1183 [11:41:23] the host is up, puppet has run on it and on alert1001 [11:57:32] kormat: is the new host being listed in the icinga config? [11:57:45] elukey: how does one check? [11:58:36] it's in /etc/icinga/objects/puppet_hosts.cfg, if that's the right thing [11:58:44] yeah I checked it, looks good [12:00:03] there's also a critical for invalid icinga config from four days ago, that's likely it [12:02:20] ah! yes definitely, I missed it [12:03:07] Error: Could not find any hostgroup matching 'pki_codfw' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 46603) [12:03:25] yep [12:04:27] gah, yeah that's a papercut alright [12:04:28] i've no idea what a hostgroup is in icinga. the host in equestion is pki2001 [12:05:44] yeah it is part of monitoring::groups in this case, I'll fix it [12:05:49] godog: thanks <3 [12:07:33] sure! fixage at https://gerrit.wikimedia.org/r/c/operations/puppet/+/704101 [12:08:29] at least CI should fail on that condition or even better autogenerate parts of monitoring::groups [12:08:44] I'll file a task #papercut [12:19:12] db1183 now shows up in the icinga web "u/i" \o/ [12:28:21] godog++ [12:30:23] \o/ [12:36:39] please do not run the sre.dns.netbox cookbook for the next ~20 minutes, I noticed there is some wrong data in Netbox I'm about to fix [12:39:11] * elukey runs the cookbook [12:39:18] * kormat drops all data from netbox [12:40:21] bunch of bloody troublemakers [12:51:45] ok fixed, you can resume normal operations, thx [14:08:13] sorry about that godog, thanks for the fix <3 [14:14:44] jbond: np! it is a silly thing to have to keep the two lists in sync [14:16:34] ack thanks will try to rember for the future :) [14:17:04] jbond: hi! any guesses why cfssl is failing on deployment-parsoid12? T286375 [14:17:05] T286375: Puppet failing on deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud due to cfssl signing failure - https://phabricator.wikimedia.org/T286375 [14:17:59] majavah: possibly, i did some upgrades a couple of weeks ago, give me five mins and ill take a look [14:18:13] cool, thank you! [14:39:42] majavah: looking on deployment-puppetmaster04 the sync job is faililng with merge conflicts. specifically for this issue it is missing https://gerrit.wikimedia.org/r/c/operations/puppet/+/703580 [14:42:16] topranks: o/ if you have a moment later on https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104 [14:42:43] no probs will have a look now shortly [14:45:34] jbond: good catch, thanks. fixed that, let's see if it works now [14:47:52] majavah: puppet runs look good now [15:15:24] elukey: That CR looks good to me. Are the ctrl nodes already deployed and ready to go? [15:15:41] If so we can merge any time and push the changes with homer, happy for you to do that or I can take care of it. [15:22:22] topranks: yep all ready! I can take care of homer if you are ok, just wanted to get a +1 and make sure that it was a good time [15:24:04] Yeah now is fine, or any time really so fire away. [15:24:30] The diff should be fairly self-explanatory when you run homer, adding the new IPs to group "Kubemlserve[4|6]" [15:24:49] any concerns just grab me here. [15:26:42] yep the diff looks very simple and easy, going to commit! [15:27:04] thanks a lot! [15:27:09] nice :) [17:02:08] open for questions if you have any- first time I saw the process I was confused [17:15:30] topranks: ema: I've created a draft at https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-06-15_Eqsin_network based on public/common information available to me. Feel free to edit/elaborate as you see fit. [17:15:53] I believe the reason the restore took ~15min was DNS, is that right? I haven't mentioned it but let me know if that's right. [17:16:11] I noticed the private doc still says "ongoing", not sure if that matters, but fyi :) - https://docs.google.com/document/d/1_rV0RU9wZ0Y1VQUJkOq5L2uDUv-7XgOCuJyR6o5f_BY/edit [17:19:00] DNS TTL is actually 10 minutes: [17:19:07] cathal@officepc:~$ dig +noall +answer dyna.wikimedia.org @ns0.wikimedia.org. [17:19:07] dyna.wikimedia.org. 600 IN A 91.198.174.192 [17:19:49] I think I listed as 15 in the timeline as I was looking at network traffic, and our polling of router bandwidth is only every few minutes, so it took that long to be fully confident we were at "normal" levels. [17:20:16] But yes, DNS was the reason, so probably best to list that. Thanks! [17:21:37] I'll have a closer look at the incident report [17:34:59] Krinkle: That summary looks good to me thanks.