[02:23:48] <lmata>	 Hi all i am reorganizing our Phab Board to a new tag, apologies for the phab bot spam 
[02:25:02] <lmata>	 *phabricator 
[07:14:37] <RhinosF1>	 lmata: FYI, in future you can mute notifications too
[08:06:40] <topranks>	 Folks I've revised the proposed schedule for the planned maintenances on switch stacks in eqiad, reversing the order we do them in effectively (so row D first).
[08:07:29] <topranks>	 This is to accommodate some database migration work that needs to happen in advance of row A/B being touched.
[08:07:49] <topranks>	 Schedule has been updated on https://phabricator.wikimedia.org/T284592
[08:08:26] <topranks>	 We will obviously try to work with everyone to ensure the timetable is ok, so please do advise if this is too soon etc.
[08:08:27] <topranks>	 thanks
[08:17:11] <elukey>	 topranks: hi! Thanks a lot for all the precise work and schedule, it really helps :) 
[08:18:39] <elukey>	 I have to say that I wouldn't change everything for marostegui and kormat (kidding, <3 <3)
[08:19:09] <kormat>	 ಠ_ಠ
[08:20:47] <topranks>	 haha
[08:28:31] <apergos>	 topranks: I like the new schedule better actually, it means I can probably avoid potential impact altogether on "my" stuff!
[08:28:58] <apergos>	 so as far as I'm concerned, it's being changed for me, not for those dba types :-P :-D
[08:29:10] <topranks>	 ah great!  the worry always with such a change is you make it even worse for everyone else.
[08:29:26] <apergos>	 I don't know about other service owners but yeah big +1 from me :-)
[08:29:27] <topranks>	 and of course, it was all for you, we wouldn't lift a finger for those pesky dba's :)
[08:29:31] <apergos>	 lololol
[08:29:35] <apergos>	 happy monday everybody!
[08:35:48] <btullis>	 Morning all.
[08:57:27] <jayme>	 apergos || jynus: we do lack a recording of the "WMF databases" onboarding chat and I'm not completely sure whom to ask to do it :)
[08:59:01] <jayme>	 ema: would you be so kind to place the slides for "The CDN" into the onboarding chats folder?
[08:59:19] <ema>	 jayme: will do!
[08:59:51] <jayme>	 ema: <3 (I just renamed the video to "15 - The CDN"
[09:00:35] <ema>	 great title, I approve
[09:00:36] <jayme>	 I think it might make more sense to have it at after "Networkd, ins and outs", though
[09:01:02] <jayme>	 eheh, the title is your creation I suppose
[09:06:57] <apergos>	 jayme: um. I've never given the slide dick for the dbs that I did (or fixed them up after some valuable comments from the dbas)... and they would need to be updated
[09:07:13] <apergos>	 should I do that? does someone else want to do a better presentation?
[09:07:38] * apergos looks at marostegui
[09:08:11] <ema>	 jayme: uploaded as 15-The.CDN.pdf
[09:08:19] <jayme>	 thanks ema
[09:14:54] <apergos>	 *slide deck!!  ugh. mondays 
[09:15:46] <ema>	 just discovered jbond's puppet presentation, great stuff for a Monday!
[09:18:26] <jbond>	 :D slow monday ema?
[09:18:36] <mutante>	 got my old broken laptop to boot again.. after the brandnew one to replace it .. broke.. heh
[09:19:11] <ema>	 jbond: I'm learning lots of stuff really! :)
[09:19:34] <ema>	 especially about puppetdb
[09:20:43] <jbond>	 :D good good, ping if you have any follow ups.  fyi for puppetdb stuff you should check out puppetdb_query
[09:22:09] <jbond>	 i think its only mentioned once on https://puppet.com/docs/puppetdb/5.2/api/query/tutorial.html but it allows you to use the full puppet api syntax including pql (https://puppet.com/docs/puppetdb/latest/api/query/v4/pql.html) from puppet (so a bit more flexible then the stuff in 'puppetdbquery' (the module) 
[11:41:10] <kormat>	 ok, maybe someone can tell me what i'm missing here. i've added a new (well, renamed) host to puppet, but it's not showing up on icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1183
[11:41:23] <kormat>	 the host is up, puppet has run on it and on alert1001
[11:57:32] <elukey>	 kormat: is the new host being listed in the icinga config?
[11:57:45] <kormat>	 elukey: how does one check?
[11:58:36] <kormat>	 it's in /etc/icinga/objects/puppet_hosts.cfg, if that's the right thing
[11:58:44] <elukey>	 yeah I checked it, looks good
[12:00:03] <godog>	 there's also a critical for invalid icinga config from four days ago, that's likely it
[12:02:20] <elukey>	 ah! yes definitely, I missed it
[12:03:07] <kormat>	 Error: Could not find any hostgroup matching 'pki_codfw' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 46603)
[12:03:25] <elukey>	 yep
[12:04:27] <godog>	 gah, yeah that's a papercut alright
[12:04:28] <kormat>	 i've no idea what a hostgroup is in icinga. the host in equestion is pki2001
[12:05:44] <godog>	 yeah it is part of monitoring::groups in this case, I'll fix it
[12:05:49] <kormat>	 godog: thanks <3
[12:07:33] <godog>	 sure! fixage at https://gerrit.wikimedia.org/r/c/operations/puppet/+/704101
[12:08:29] <godog>	 at least CI should fail on that condition or even better autogenerate parts of monitoring::groups
[12:08:44] <godog>	 I'll file a task #papercut
[12:19:12] <kormat>	 db1183 now shows up in the icinga web "u/i" \o/
[12:28:21] <elukey>	 godog++
[12:30:23] <godog>	 \o/
[12:36:39] <volans>	 please do not run the sre.dns.netbox cookbook for the next ~20 minutes, I noticed there is some wrong data in Netbox I'm about to fix
[12:39:11] * elukey runs the cookbook
[12:39:18] * kormat drops all data from netbox
[12:40:21] <apergos>	 bunch of bloody troublemakers
[12:51:45] <volans>	 ok fixed, you can resume normal operations, thx
[14:08:13] <jbond>	 sorry about that godog, thanks for the fix <3
[14:14:44] <godog>	 jbond: np! it is a silly thing to have to keep the two lists in sync
[14:16:34] <jbond>	 ack thanks will try to rember for the future :)
[14:17:04] <majavah>	 jbond: hi! any guesses why cfssl is failing on deployment-parsoid12? T286375
[14:17:05] <stashbot>	 T286375: Puppet failing on deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud due to cfssl signing failure - https://phabricator.wikimedia.org/T286375
[14:17:59] <jbond>	 majavah: possibly, i did some upgrades a couple of weeks ago, give me five mins and ill take a look 
[14:18:13] <majavah>	 cool, thank you!
[14:39:42] <jbond>	 majavah: looking on deployment-puppetmaster04 the sync job is faililng with merge conflicts.  specifically for this issue it is missing https://gerrit.wikimedia.org/r/c/operations/puppet/+/703580
[14:42:16] <elukey>	 topranks: o/ if you have a moment later on https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104
[14:42:43] <topranks>	 no probs will have a look now shortly
[14:45:34] <majavah>	 jbond: good catch, thanks. fixed that, let's see if it works now
[14:47:52] <jbond>	 majavah: puppet runs look good now
[15:15:24] <topranks>	 elukey:  That CR looks good to me.  Are the ctrl nodes already deployed and ready to go?
[15:15:41] <topranks>	 If so we can merge any time and push the changes with homer, happy for you to do that or I can take care of it.
[15:22:22] <elukey>	 topranks: yep all ready! I can take care of homer if you are ok, just wanted to get a +1 and make sure that it was a good time
[15:24:04] <topranks>	 Yeah now is fine, or any time really so fire away.
[15:24:30] <topranks>	 The diff should be fairly self-explanatory when you run homer, adding the new IPs to group "Kubemlserve[4|6]"
[15:24:49] <topranks>	 any concerns just grab me here.
[15:26:42] <elukey>	 yep the diff looks very simple and easy, going to commit!
[15:27:04] <elukey>	 thanks a lot!
[15:27:09] <topranks>	 nice :)
[17:02:08] <jynus>	 open for questions if you have any- first time I saw the process I was confused
[17:15:30] <Krinkle>	 topranks: ema: I've created a draft at https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-06-15_Eqsin_network based on public/common information available to me. Feel free to edit/elaborate as you see fit.
[17:15:53] <Krinkle>	 I believe the reason the restore took ~15min was DNS, is that right? I haven't mentioned it but let me know if that's right.
[17:16:11] <Krinkle>	 I noticed the private doc still says "ongoing", not sure if that matters, but fyi :) - https://docs.google.com/document/d/1_rV0RU9wZ0Y1VQUJkOq5L2uDUv-7XgOCuJyR6o5f_BY/edit
[17:19:00] <topranks>	 DNS TTL is actually 10 minutes:
[17:19:07] <topranks>	 cathal@officepc:~$ dig +noall +answer dyna.wikimedia.org @ns0.wikimedia.org.
[17:19:07] <topranks>	 dyna.wikimedia.org.	600	IN	A	91.198.174.192
[17:19:49] <topranks>	 I think I listed as 15 in the timeline as I was looking at network traffic, and our polling of router bandwidth is only every few minutes, so it took that long to be fully confident we were at "normal" levels.
[17:20:16] <topranks>	 But yes, DNS was the reason, so probably best to list that.  Thanks!
[17:21:37] <topranks>	 I'll have a closer look at the incident report
[17:34:59] <topranks>	 Krinkle:  That summary looks good to me thanks.