[00:01:37] for just that step, the VIPs in Netbox, no [00:01:48] just run the dns netbox cookbook afterwards [00:02:23] but in general just follow the order of the patches in general if you want less toil :) [00:03:15] two in general but you get the general idea [00:07:06] Cool. Yeah I'm trying to get all the patches stacked and ready so we can proceed efficiently when it's time [00:08:47] I've added patches for most of the steps now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094074 I think I'm just missing a DNS patch maybe, and also will need to add the allocated IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094061/4 once I've clicked the button in netbox [00:13:53] I'm (perhaps overly optimistically) hoping we can do pybal rolling restarts on tuesday the 26th, if somebody is around. I should have the patches fully ready by end of day friday so I'll have monday free to look over stuff and make sure everything looks good [00:24:33] anyway I'm drafting an email with a more formal proposed plan and a whole lot more context, I'll fire that off in a few hours [01:48:26] no problem, add me or brett for the reviews and Tuesday works for us [01:48:56] add us both actually since brett will be busy with the magru work so I will take care of it [02:17:18] great! [03:08:49] fired off the email to `sre-traffic@`. maybe I should have made it a phab comment, the lack of `backticks` makes it not as easy on the eyes as I would like :P [07:37:12] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10347297 (10MoritzMuehlenhoff) How are we planning to handle removing the servers on our side? I think we should run the decom cookb... [09:57:30] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10347431 (10cmooney) This port bounced again overnight: ` cmooney@cloudsw1-d5-eqiad> show log messages.... [09:58:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 4 others: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10347436 (10akosiaris) Cool thanks, I 'll take over this one. [10:06:17] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570 (10jcrespo) 03NEW [10:06:27] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570#10347487 (10jcrespo) p:05Triage→03Unbreak! [10:08:11] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570#10347493 (10jcrespo) [10:44:05] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570#10347604 (10jcrespo) p:05Unbreak!→03High ` [11:37:05] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wi... [10:44:36] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570#10347606 (10jcrespo) a:03Fabfur [10:47:32] fabfur: vgutierrez reached out about haproxykafka not being able to produce to kafka after the ACL change. I think we need to revert [10:47:46] fabfur is OoO today [10:47:48] (related to T380373) [10:47:49] T380373: Allow TLS authenticated client to write on new topics - https://phabricator.wikimedia.org/T380373 [10:47:50] I'm taking care of that [10:47:55] gotcha [10:47:59] so let him enjoy his day off :) [10:48:17] do you need to to drop the ACLs? [10:48:19] I've disabled haproxykafka temporarily [10:48:48] we can drop the ACLs or fix the TLS config on haproxykafka itself [10:49:17] would you mind posting your findings on T380373? [10:51:48] you mentioned "fix the TLS config". Was the wrong CN configured in the ACLs? [10:52:56] (in the meantime, I can drop the ACLs if we're in incident mode) [10:53:28] 06Traffic, 06Data-Platform-SRE: Allow TLS authenticated client to write on new topics - https://phabricator.wikimedia.org/T380373#10347620 (10Vgutierrez) This triggered errors on every haproxykafka instance after losing producer access to the configured topics: ` Nov 21 17:42:40 cp5031 haproxykafka[3825906... [10:53:35] vgutierrez: if it can help, this is what kafka jumbo sees [10:53:36] Principal = User:ANONYMOUS is Denied Operation = Write from host = 10.132.0.24 on resource = Topic:webrequest_frontend_text (kafka.authorizer.logger) [10:54:10] already added to the task sorry, didn't see [10:54:37] I didn't realize that mtls wasn't enabled everywhere [10:54:45] yeah, problem has been identified [10:54:46] thx elukey [10:54:48] I thought it was just a localized test [10:54:55] this is why I +1ed yesterday [10:55:00] should've asked :( [10:55:42] as brouberol said we can also remove those rules if you want to keep the topics alive for testing etc.. [10:56:00] it is basically just s/--add/--remove/g IIRC [11:09:24] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570#10347675 (10jcrespo) I've freed ~500GB of logs on both centrallog1002 and centrallog2002 by deleting 24+ hours of cp* hosts containing the errors on the header. cp* local logs have not been touched, so... [11:18:16] elukey: ` profile::pki::get_cert('kafka', 'haproxykafka'` requires some kind of configuration on the pki side? [11:18:50] just to be sure: are y'all pushing through to get the mTLS setup deployed, or should I drop the ACLs? [11:19:11] brennen: I'll get back to you if we need to drop those ACLs, thx [11:19:27] oops.. brouberol :) [11:20:03] alright. Like e.lukey said, it's s/--add/--delete/ for each command, in case I'm not around [11:20:30] err, no. s/--add/--remove [11:23:45] ok.. I've found the config :) [11:25:09] elukey: so, is it feasible to use another set of certs here or we need to use the kafka one? [11:27:05] I guess that the question really is which CA is trusted by kafka twhen performing mTLS [11:34:54] vgutierrez: for PKI I don't think a config is needed, we can use the same as varnishkafka (I don't recall if we use the Kafka intermediate or not) [11:35:28] vgutierrez: what do you mean with "another set of certs"? [11:35:31] I'm trying to find a puppet.log that shows the puppet errors that suk.he mentioned yesterday for reverting [11:35:45] elukey: nothing, i've already seen we are using the kafka ones [11:35:53] we are only setting "haproxykafka" as the CN [11:36:00] so I'm guessing it's the same CA [11:36:49] yep in profile::cache::kafka::certificate I see that we use the kafka profile, so we are using the kafka intermediate [11:37:19] the CN is always the same because IIRC we cannot have wildcards in kafka acls etc.. [11:37:37] so once puppet provisions the cert on the cp node, then haproxykafka should be good to go [11:40:11] yeah... I was checking cp4048 and puppet yesterday worked there as expected [11:40:37] https://www.irccloud.com/pastebin/btorQ6Rz/ [11:40:51] * elukey nods [11:43:10] LOL of course [11:43:33] profile::pki::get_cert() gets called unconditionally and it requires User[$haproxykafka_user] [11:43:54] so that's a puppet failure everywhere were haproxykafka isn't deployed [11:50:20] so I'll add an ensure => absent test to the _spec [11:50:23] and fix that [11:54:25] the test doesn't fail... cause of course the dependency is still met, User[$haproxykafka_user] is actually on the catalog when absented [11:54:48] but of course cfssl will crash on execution time cause haproxykafka user isn't actually there [11:55:04] and the directory structure will be missing as well [12:09:56] elukey, sukhe https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094384 should be now ready for your review [12:10:10] or not :) [12:11:23] oh.. uncommitted change on the _spec.rb file :) [12:18:03] vgutierrez: just to double check, does rdkafka need an explicit "ssl: on" flag or we can skip it? [12:18:35] elukey: it doesn't need it, it was currently working as expected, and 9093 AFAIK is SSL only [12:19:45] LGTM [12:19:58] I was about to go afk for lunch, do you want me to stay for the rollout? [12:19:59] 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583 (10Vgutierrez) 03NEW [12:20:06] in case some kafka horror to debug comes up? [12:20:21] 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583#10347899 (10Vgutierrez) p:05Triage→03High [12:20:58] 06Traffic: haproxykafka features - https://phabricator.wikimedia.org/T374128#10347900 (10Vgutierrez) [12:21:35] elukey: nope, let's do it after lunch [12:21:40] ack! [12:21:43] I'm home alone and I need to take care of the dogs here [12:21:46] or they will eat me [12:21:52] fair enough LD [12:21:54] :D [12:22:19] https://usercontent.irccloud-cdn.com/file/ezMDovZ3/1000036669.jpg [12:22:25] she is a proper beast [12:24:16] elukey: hmm the if guard isn't working as expected [12:24:49] see PCC output for cp3067 where haproxykafka ensure is set to absent: https://puppet-compiler.wmflabs.org/output/1094392/2478/cp3067.esams.wmnet/index.html [12:28:31] where is set to absent? [12:28:55] hieradata/common/profile/cache/haproxykafka.yaml:profile::cache::haproxykafka::ensure: absent [12:29:44] so stdlib::ensure($ensure) is evaluating to true somehow and it doesn't a lot of sense [12:30:22] stdlib::ensure returns a $ensure.bool2str('present', 'absent') when the optional parameter $resource is undef [12:30:45] I was reading the ensure.pp file yes [12:30:50] oh nope.. [12:30:54] it doesn't do that [12:31:02] it translates a boolean to 'rpesent'/'absent' [12:31:12] bool2str, not str2bool lol [12:31:27] * vgutierrez brain farting since 1986 [12:31:52] didn't really catch it as well :) [12:32:03] * elukey runs to lunch [13:49:08] 06Traffic: centrallog1002, centrallog2002 running out of disk space - https://phabricator.wikimedia.org/T380570#10348106 (10jcrespo) 05Open→03Resolved a:05Fabfur→03Vgutierrez Resolved, it looks like it is being handled by #traffic at T380583. [14:31:52] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10348245 (10RobH) We were discussing this last week, and brianstormed some on https://etherpad.wikimedia.org/p/magru_server_swaps fr... [14:54:35] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10348362 (10MoritzMuehlenhoff) Ack, thanks. Either is fine with me, I can also switch them to insetup and then keep them running. [16:12:21] 06Traffic, 06Data-Platform: GeoDNS: Pipeline from event.development_network_probe to operations/dns.git - https://phabricator.wikimedia.org/T380626#10348991 (10CDobbins) [16:18:29] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10349021 (10JMeybohm) This did bite us again and we had to {T380473} in a hurry. Quick fix to free up IPAM blocks without deco... [16:29:34] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10349049 (10JMeybohm) >>! In T375845#10246786, @akosiaris wrote: > Now, for the actual change. The biggest issue is coordinati... [16:35:33] 06Traffic: GeoDNS: consider sending CN to eqsin - https://phabricator.wikimedia.org/T378744#10349068 (10ssingh) 05Open→03Resolved a:03ssingh [17:23:23] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10349255 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6b283bec-74b8-4f8c-9a46-f9... [17:33:54] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10349290 (10VRiley-WMF) Replaced the transciever in cloudsw1-e4-eqiad et-0/0/54. Will test to see if th... [18:00:01] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10349375 (10cmooney) Thanks @VRiley-WMF! Seems ok so far but we can make a call Monday based on if we... [19:00:40] 06Traffic, 07Browser-Support-Apple-Safari, 07Browser-Support-Firefox, 07Browser-Support-Google-Chrome, 07User-notice: Discovery: Deprecation of TLS 1.2 - https://phabricator.wikimedia.org/T367821#10349552 (10Xeverything11) TLS 1.3 is also available in Safari 12.1 and later. Full support in macOS Mojave a... [19:04:42] 06Traffic, 07Browser-Support-Apple-Safari, 07Browser-Support-Firefox, 07Browser-Support-Google-Chrome, 07User-notice: Discovery: Deprecation of TLS 1.2 - https://phabricator.wikimedia.org/T367821#10349572 (10Xeverything11) Let's give a update to [[https://caniuse.com/usage-table | market share]]: * Chrom... [19:12:19] 06Traffic, 07Browser-Support-Apple-Safari, 07Browser-Support-Firefox, 07Browser-Support-Google-Chrome, 07User-notice: Discovery: Deprecation of TLS 1.2 - https://phabricator.wikimedia.org/T367821#10349604 (10Xeverything11) Let's give a update to [[https://caniuse.com/usage-table | market share]]: * Chrom... [20:01:08] 06Traffic, 07Browser-Support-Apple-Safari, 07Browser-Support-Firefox, 07Browser-Support-Google-Chrome, 07User-notice: Discovery: Deprecation of TLS 1.2 - https://phabricator.wikimedia.org/T367821#10349758 (10Izno) Usually better to use our own stats. - [[https://steve-adder.toolforge.org/?wanted=Edge&wa... [21:39:05] 06Traffic: Improve geo-maps file - https://phabricator.wikimedia.org/T380651 (10CDobbins) 03NEW