[02:00:33] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1002.eqiad.wmnet with OS b... [02:12:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bulls... [02:21:56] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [02:22:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) 05Stalled→03Resolved This is complete [09:10:22] vgutierrez, fabfur o/ I'd like to depool cp4037 and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/924509 if you are ok [09:18:55] elukey: ok for me, vgutierrez will be available in a few hours [09:21:10] fabfur: ack! I can probably wait, I am not 100% sure about the depool command. I'd just use `service=cdn,name=cp4037.ulsfo.wmnet`, apply the change check logs etc.. [09:21:50] you do use a cookbook for that? [09:31:13] nope I use confctl usually [09:31:28] like [09:31:29] elukey@puppetmaster1001:~$ sudo -i confctl --quiet select 'name=cp4037.ulsfo.wmnet' get [09:31:32] {"cp4037.ulsfo.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-be"} [09:31:35] {"cp4037.ulsfo.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=ulsfo,cluster=cache_text,service=cdn"} [09:31:38] do we have a cookbook for it? [09:31:53] if so I didn't use it before [09:32:47] mmm let me check [09:33:22] 'cause I'm not 100% sure about the depool command too (I've only used cookbooks) [09:33:40] so maybe is safer to wait for someone more experienced than me with this [10:15:32] no cookbook for that, just the usual confctl depool [10:21:49] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye [10:53:50] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye completed:... [13:23:13] sukhe: o/ is it ok if I depool cp4037? I assume that I can depool both "cdn" and "ats-be" services right? [13:23:23] or is there a preferred one? [13:23:46] cdn is haproxy + varnish, ats-be is just ats [13:23:54] I think probably best to do both in your case so yep [13:24:21] in theory I'd need only up to the varnish-fe [13:24:40] but IIRC the ats-be is not shared anymore between multiple nodes right? [13:24:44] or am I misremembering? [13:25:04] single-backend is only in ulsfo and eqsin, so cp4037 would be that yep [13:25:10] ahhhh okok [13:25:46] elukey: I had say depool both and go for it [13:25:58] I am around from traffic if that matters [13:26:04] done! [13:26:10] applying the vk change now [13:26:17] ok! good luck [13:29:09] ah wow interesting, user 'kafka' is not present on cp nodes [13:29:27] should have thought about it [13:30:11] all good :) [13:30:38] elukey: want me to disable puppet on cp? [13:30:38] it should be deployed on all nodes though, but I'll swap it to root [13:30:50] nono it is fine, downtimed for the moment [13:30:56] the change is only for 4037 [13:30:58] oh wait, you just did for cp4037 [13:31:00] right, ok great then [13:31:06] not an issue at all sorry [13:31:44] nono thanks for the support :) [13:32:25] sukhe: created https://gerrit.wikimedia.org/r/c/operations/puppet/+/928846 [13:33:00] pcc is running but it should only be for 4037 [13:33:41] yep looks good :) [13:33:56] looking [13:35:53] thanks :) [13:37:57] ok all good! But of course varnishkafka doesn't like the new cert, going to investigate [13:40:22] hth if I can :) [13:47:18] elukey: you are the expert [13:47:22] but I think kafka.ssl.keystore.type needs to be set [13:47:36] yeah I tried but it doesn't recognize it sigh [13:47:38] in /etc/varnishkafka/webrequest.conf [13:47:43] oh I see [13:48:07] it maybe the version of librdkafka [13:48:22] out of curiosity, what did you set it to? [13:48:32] PKCS12 [13:48:36] when I tried [13:48:45] yeah ok... matches up [13:53:36] checked via openssl the keystore + pass, they look ok [13:56:15] perhaps we also need to set kafka.ssl.truststore.type? [13:56:23] sorry, all guess-work from my side [13:56:43] was going by https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/19.0.x?topic=fcee-kafka-by-using-ssl-only-1 [13:56:57] good for a brainbounce, I am wondering if librdkafka supports it [13:56:59] I mean our version [14:06:24] going to revert the change, I think that a good way forward is probably to have a varnishkafka instance that pushes to kafka test [14:06:27] and experiment on it [14:06:36] ok [14:08:42] (SystemdUnitFailed) firing: varnishkafka-webrequest.service Failed on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:38] the other worry that I have, after reading https://github.com/confluentinc/librdkafka/wiki/Using-SSL-with-librdkafka#configure-librdkafka-client, is that the keystore may not be used for client auth [14:14:23] 10Traffic, 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) ` Jun 09 14:05:42 cp4037 varnishkafka[3568251]: %3|1686319542.526|FAIL|varnishkafka#producer-1| [thrd:ssl://kafka-jumbo1009.eqiad.wmnet:9093/bootstrap]: ssl://kafka-jum... [14:14:52] cp4037 repooled [14:14:58] thanks for the help! [14:17:55] np! thanks for reverting [14:18:03] and sorry for not being helpful :) [14:18:42] (SystemdUnitFailed) resolved: varnishkafka-webrequest.service Failed on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:34] 10Traffic, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10SRE, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Elitre) >>! In T261694#8906865, @Galessandroni wrote: > Hi. In Vikidia (an European Wikipedia for kids) we have sev... [19:04:55] 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10BCornwall) @herron I see there's an unresolved conversation in that patch. Since @Vgutierrez +1ed it before that conversation, I just want to make sure that it is, indeed,... [19:15:11] 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10herron) >>! In T326657#8919271, @BCornwall wrote: > @herron I see there's an unresolved conversation in that patch. Since @Vgutierrez +1ed it before that conversation, I j... [19:27:09] 10Domains, 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03BCornwall