[05:39:57] 10serviceops, 10Data-Engineering, 10SRE, 10Traffic, 10Trust-and-Safety: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10odimitrijevic) [07:24:48] good morning folks [07:24:52] today's menu: [07:24:57] kubernetes2017 https://gerrit.wikimedia.org/r/c/operations/puppet/+/770439/1 [07:25:08] kubernetes1007 https://gerrit.wikimedia.org/r/c/operations/puppet/+/770440/1 [07:25:22] (first eqiad node --^) [07:29:36] <_joe_> elukey: full steam ahead [07:29:55] <_joe_> elukey: but also, weren't we supposed to let the manager pretend he's an engineer? [07:32:21] _joe_ the manager is adding new kubernetes nodes today, sssshhhh [07:32:30] * elukey runs away [07:32:32] :D [07:33:59] ah I just realized that the recipes may be a little more correct, let me amend them [07:35:02] no nevermind, all good :) [07:36:34] (need to step afk for ~30 mins, will be back later and start the reimages) [08:13:55] o/ I think we should maybe wait for the manager to add at least one of the new nodes so that we end up with less re-scheduling [08:20:30] <_joe_> do we have a "blocked on management" tag on phabricator? [08:21:02] eheh [08:35:02] ack will do 2017 then! [08:48:37] 10serviceops, 10SRE, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10MoritzMuehlenhoff) 05Open→03Declined This doesn't seem relevant any more, I'll boldly go ahead and close it. We originally used it for HHVM and these days we can easily insta... [09:06:28] elukey: looking at the kask nodes I think we could drop vdb completely and use one 20GB disk for LVM (for / and /var/lib/docker) or maybe even get away with just the 10GB vda (as those nodes only run a very limited set of containers) [09:09:05] if that makes things easier...if not, we could also just use flat.cfg (minus swap) for vda and mount vdb as /var/lib/docker (or /var/lib if we want to catch /var/lib/kubelet as well) [09:12:09] jayme: the cleanes option in my opinion is to drop vdb and expand vda to 20G, install bullseye with a regular partition scheme without swap [09:12:46] elukey: that means everything is in /, right? [09:12:56] jayme: yes yes [09:13:16] absolutely fine with that for those perticular nodes! [09:13:17] for ganeti I don't see a great advantage in having multiple partitions, but we can do anythings [09:13:20] *anything [09:13:32] agreed [09:21:09] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10dcausse) [09:35:07] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10dcausse) [09:42:17] kubernetes2017 done! [09:42:53] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10elukey) [09:45:04] jayme: in theory if we go for the single /dev/vda we could keep the current partman confgi [09:45:26] since vdb is unmanaged atm for device mapper no? [09:45:53] aiui, yes [09:48:00] perfect, I am going to write a simple reimage plan for the first codfw kask node, and then I'll also ask to Moritz a quick sanity check [09:48:13] anything peculiar to do for the kask nodes besides drain? [09:48:41] less changes sounds preferable. LVM is still a bit of an overkill there, but I guess that overhead is neglectable [09:49:10] elukey: No. From maintenance perspective they can be treated like normal nodes if you go 1 by 1 [09:50:09] * elukey sees himself with the tshirt "I broke wikipedia" [09:59:57] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10elukey) There are some ganeti VMs running as kubernets nodes in both clusters, with two vir... [10:00:00] jayme: --^ [10:00:42] in theory it should work [10:01:24] sounds good to me [10:03:30] and the companion cr is https://gerrit.wikimedia.org/r/c/operations/puppet/+/770459 [10:07:29] +1ed [10:15:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade kubernetes clusters to a security supported (LTS) version - https://phabricator.wikimedia.org/T244335 (10JMeybohm) [10:19:07] jayme: Got the +1 from Moritz too, ok to drain and start with 2005? [10:19:31] elukey: yeah, go ahead! [10:19:39] ack! [10:47:57] akosiaris: from looking at the calendar it seems we should put the "Core" meeting part in PST/PDT TZ so it switches to US daylight savings time together with the other one :) [10:56:15] the recipe for vms seems to work, the only nit it that I had to confirm something during d-i (namely to use all the 20g available since the recipe seems to suggest to use ony 10 of course) [10:56:24] I am running puppet on 2005 now [11:03:53] jayme: kubernetes2005 ready for a check :) [11:04:29] cool. will do in a minute [11:10:00] elukey: looks good from my POV [11:13:32] ah one thing that I noticed, the interface name was updated but nextbox isn't (of course) [11:22:48] ok fixed! [11:23:27] aand uncordoned [11:23:51] Cc: hnowlan: o/ as FYI we just drained + reimaged + uncordoned kubernetes2005 (one of the kask nodes) [11:24:19] (IIRC it falls under your radar but if not throw everything in /dev/null :) [11:27:01] _joe_, jayme: cal DST issue fixed. damn I hate DST. [11:31:18] <_joe_> ^^ [11:31:31] <_joe_> there's only one thing I despise more [11:31:33] <_joe_> leap seconds [11:38:14] elukey: ack! [12:27:26] Apologies for sounding like a broken record, but I'm still hoping to progress with this as soon as it's practicable: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764375 [12:28:28] There's one outstanding question about how I should address a service, but it's related to the TLS configuration that we haven't looked at yet. [12:29:04] If there's anything I can do to help, please do let me know. Thanks. [13:51:05] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ayounsi) I came across 3 planned parse servers in rack C8, https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=24&role=server As a reminder, C8 and D5 are dedica... [14:27:17] 10serviceops: Stop loading wddx PHP extension - https://phabricator.wikimedia.org/T295725 (10JMeybohm) 05Open→03Resolved From 7.4 onward wddx is no longer loaded. [14:27:21] 10serviceops, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10JMeybohm) [14:29:48] 10serviceops: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10JMeybohm) Manual tests (with and without cookie `PHP_ENGINE=7.4`, as well as invalid values) against appserver and parsoid seem fine to me so far [14:40:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10akosiaris) Replying instead of Daniel, he is currently unavailable. @Cmjohnson, I guess rows E & F are ok, I think it will be the first stuff we will be operating... [14:45:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10elukey) All good, kubernetes2005 reimaged as planned. The only little issue that I encounte... [15:54:14] rzl: _joe_: akosiaris: https://www.dict.cc/?s=sowas [15:54:42] the abbreviation is pretty good actually :) [15:54:49] ah, lol. Just know I got it. Not bad at all! [15:58:39] ahaha perfect [16:10:59] 10serviceops, 10Prod-Kubernetes: Keep track of teams responsible for namespaces inside kubernetes - https://phabricator.wikimedia.org/T303744 (10JMeybohm) p:05Triage→03Low [16:15:34] if someone could give https://gerrit.wikimedia.org/r/c/operations/dns/+/770529 a quick review (simple three-line DNS SRV record addition), I would be eternally grateful :) [16:17:58] 10serviceops, 10SRE, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [16:18:49] klausman: that's a long time - took the deal :) [16:19:24] Noted :) Here's your beverage-of-choice voucher :) [16:42:35] btullis: I already started the second round. Will have something by tomorrow [16:48:22] 10serviceops, 10Release Pipeline, 10SRE, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [16:49:13] 10serviceops, 10Release Pipeline, 10SRE, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) 05Open→03Resolved a:03akosiaris Resolving. Wikifeeds has been migrated, restrouter migration was cancelled, the process is d... [17:39:50] jayme: Many thanks and sorry to be a pain. [20:21:11] 10serviceops, 10SRE, 10envoy: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus) [20:27:58] 10serviceops, 10Beta-Cluster-Infrastructure, 10SRE, 10envoy: Clean up Puppet support for Envoy v2 config API - https://phabricator.wikimedia.org/T303770 (10RLazarus)