[07:17:44] jayme: o/ just seen your Friday message - I'll proceed with kubestage2004 this morning, and I guess you'll do 1003? (I just merged the patch that you sent for partman) [07:18:24] we should probably send an email to ops@ to warn people about it, even if we do it early in the EU morning it should go unnoticed [07:39:19] <_joe_> why would people notice? [07:39:34] good morning :) [07:39:40] if they deploy while we reimage [07:39:47] (to staging) [07:40:01] <_joe_> don't we still have a node up that can host all of our staging containers? [07:40:27] yep yep, but if anything comes up people know what we are doing [07:40:32] anway, I am reimaging kubestage2002 [08:38:28] kubestage2002 on bullseye :) [08:38:36] (and uncordoned) [08:48:53] _joe_ if it is not a problem I can depool kubestage1003 and reimage it as well [08:52:01] elukey: o/ I wasn't sure if we should do one node in eqiad first and then the other codfw one but I think it does not really matter [08:56:04] jayme: ack, I think that we can move the staging eqiad cluster to overlay / bullseye as well, and then think about one kubernetes* node [08:56:31] +1 [08:56:48] I'll also take care of the ml-serve clusters [08:57:04] jayme: ok to proceed with 1003 then? [08:57:42] elukey: yes, sure! [08:57:50] * elukey proceeds [09:45:59] kubestage1003 up and running (uncordoned and taking traffic again) [09:46:18] need to step afk for a bit, if all looks good I can also reimage 1004 [10:01:48] great, thanks! [10:08:31] * jayme rebalanced pods in staging-codfw [10:12:16] 10serviceops, 10Add-Link, 10Growth-Team, 10Patch-For-Review: Many repeated config file changed / config file reloaded messages from promehteus statsd exporter - https://phabricator.wikimedia.org/T300629 (10JMeybohm) [10:23:42] 10serviceops, 10Add-Link, 10Growth-Team, 10Patch-For-Review: Many repeated config file changed / config file reloaded messages from promehteus statsd exporter - https://phabricator.wikimedia.org/T300629 (10JMeybohm) 05In progress→03Resolved I've applied the default change and tested with linkrecommenda... [10:36:07] jayme: if you are ok I am going to proceed with kubestage1004 [10:36:34] elukey: no objections [10:40:59] mmm interesting, pulling the istiod image is taking ages [10:41:29] on 1003 I mean, after draining 1004 [10:41:45] it took 4 minutes [10:41:49] and now it finished [10:42:14] jayme: --^ [10:42:17] normal on staging noes? [10:42:21] *nodes [10:42:29] uhm...no [10:42:55] that's pretty long. The eqiad nodes are very new as well [10:43:24] it is around 150MB, weird [10:43:56] Normal Pulling 4m4s kubelet, kubestage1003.eqiad.wmnet Pulling image "docker-registry.discovery.wmnet/istio/pilot:1.9.5-5" [10:45:48] does not seem like ther is a problem in dragonfly p2p nework ..strage. [10:47:21] 10serviceops, 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10MatthewVernon) [10:49:57] elukey: I think the node might have been pulling a bunch of images in parallel (because of the drain and it's pretty cold cache due to recent reimage= [10:52:16] jayme: makes sense yes, I think that we can proceed with 1004 [10:52:22] ack [11:07:11] 10serviceops, 10Prod-Kubernetes: setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T302208 (10JMeybohm) [11:07:32] 10serviceops, 10Prod-Kubernetes: setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T302208 (10JMeybohm) [11:28:48] 1004 back from the reimage, uncordoned and pooled [11:28:55] so all staging envs are on bullseye now! [11:36:36] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Done by Feb 23🔥): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10MatthewVernon) [11:37:46] going afk for lunch, lemme know if anything looks weird [13:11:37] I'd like to further scale up jobqueue, any objections? number of replicas is getting a little high but unfortunately might be necessary for the short-term https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/762418 [13:15:16] 10serviceops, 10DC-Ops, 10SRE: setup/install mc20[38-55] - https://phabricator.wikimedia.org/T302218 (10akosiaris) [13:43:52] 10serviceops, 10Parsoid: Move testreduce to nodejs 12 - https://phabricator.wikimedia.org/T301303 (10MatthewVernon) [13:44:38] 10serviceops, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10MatthewVernon) [13:45:36] hnowlan: +1 - still looks fine [14:02:14] 10serviceops, 10WMDE-Technical-Wishes-Maintenance: Migrate kartotherian production service to node12 - https://phabricator.wikimedia.org/T301475 (10MatthewVernon) [14:02:31] 10serviceops, 10WMDE-Technical-Wishes-Maintenance: Migrate geoshapes production service to node12 - https://phabricator.wikimedia.org/T301476 (10MatthewVernon) [15:05:01] 10serviceops, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Joe) Ok so a few requirements: 1) we need the repository to be on gerrit, and to include a `.pipeline` directory t... [15:06:29] 10serviceops, 10SRE: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) a:03Joe [15:13:19] 10serviceops, 10SRE: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) I think this is the old etcd certificate we used to use for etcd in codfw; since we've moved to etcd v3 we're using a new cert created with cergen: ` $ openssl s_client -host conf2004.codfw.wmnet -p... [16:44:29] 10serviceops, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Michaelcochez) @Joe for the base image, would you recommend our current approach of starting from an 'empty' image... [16:48:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T293728 (10elukey) @akosiaris both staging clusters are on bullseye with overlay, I have updated the hiera settings after some rounds of reimage. I am currently reimaging all ml-serve nodes wi... [16:59:14] 10serviceops, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Joe) >>! In T301471#7726097, @Michaelcochez wrote: > @Joe for the base image, would you recommend our current appro... [17:02:38] 10serviceops, 10Prod-Kubernetes: setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T302208 (10JMeybohm) [17:02:43] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10JMeybohm) [17:04:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T293728 (10JMeybohm) >>! In T293728#7726108, @elukey wrote: > Then once the host is up and running, uncordon/pool/etc.. For new nodes it is easier, maybe we could try to add one with bullseye... [17:10:11] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10SRE, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) >>! In T274388#7722815, @MSantos wrote: > @akosiaris and @jijiki how can we move forward with this? > > For context: > - [[... [18:16:55] ml-serve-codfw on bullseye + overlay! (8 worker nodes in total)