[08:32:45] 10serviceops, 10LandingCheck, 10MW-on-K8s, 10MediaWiki-extensions-WikimediaEvents, and 4 others: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat. - https://phabricator.wikimedia.org/T352156 (10Joe) >>! In T352156#9365189, @hashar wrote: >>>! In T3521... [08:35:15] 10serviceops, 10LandingCheck, 10MW-on-K8s, 10MediaWiki-extensions-WikimediaEvents, and 4 others: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat. - https://phabricator.wikimedia.org/T352156 (10Joe) 05Resolved→03Open The task is not resolved until... [08:43:55] 10serviceops: Migrate etcd::tlsproxy Nginx certs to PKI - https://phabricator.wikimedia.org/T352245 (10MoritzMuehlenhoff) [09:09:49] 10serviceops, 10LandingCheck, 10MW-on-K8s, 10MediaWiki-extensions-WikimediaEvents, and 4 others: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat. - https://phabricator.wikimedia.org/T352156 (10Joe) 05Open→03Resolved I created the subtasks, assign... [10:00:53] 10serviceops: Migrate etcd::tlsproxy Nginx certs to PKI - https://phabricator.wikimedia.org/T352245 (10Joe) When we make the change, it will require a restart of etcd on the nodes. We will need to perform the change and then issue a restart of all pybals connected to the specific server, and when we're done, al... [10:24:17] _joe_: I have a series of pending patches for the deployment server `fix-staging-perms` script. It is been floating for a few months, maybe I can pair it with someone else? ( the series start at https://gerrit.wikimedia.org/r/c/operations/puppet/+/927674/ ) [10:25:47] <_joe_> hashar: right now I don't have time, let's see if someone else in the team does [10:28:11] cool :) [10:28:22] and https://phabricator.wikimedia.org/T338205 has the rationale for each of the 3 puppet changes [10:40:41] hashar: I left a drive-by comment to one of the patches [10:41:33] I'm having a look, same comment as taavi [11:25:47] 10serviceops, 10WMF-JobQueue: Make changeprop-jobqueue error handling/httpbb tests better behaved - https://phabricator.wikimedia.org/T352265 (10hnowlan) [11:26:31] 10serviceops, 10WMF-JobQueue: Make changeprop-jobqueue error handling/httpbb tests better behaved - https://phabricator.wikimedia.org/T352265 (10hnowlan) p:05Triage→03Low [13:12:34] taavi: claime: I guess I screwed up a merge conflict resolution earlier :] [13:12:40] I fixed it and rebased the series [13:23:59] hashar: ok lgtm now, we'll probably have to revamp that entire system once mw-on-k8s is completed [13:24:06] It's getting a bit... layered [13:24:41] hehe [13:24:44] ah you need me for +2 as well [13:25:12] I think I did that series as a follow up to some incident and went to ensure whatever the issue waswould not reproduce [13:25:26] so ideally merge a change, run pppet on the deployment server to check what is going on [13:25:32] move to next patch, run puppet etc [13:25:42] right [13:25:48] here we go then [13:26:24] given that fix-staging-perms is merely for non root to be able to recover access to files in /srv/mediawiki-staging [13:26:41] in case someone ran a command as root or the group writable bit is missing [13:26:53] so that should not affect anything :) [13:31:01] $ cat /usr/local/etc/fix-staging-perms.sh [13:31:01] deployment_group="deployment" [13:31:04] that is encouraging :) [13:32:21] lots of files have the wikidev group [13:32:36] 430 to be precise [13:32:59] my understanding is all of them should now be owned by `deployment` [13:33:31] at least the /srv/patches for sure [13:33:51] for /srv/mediawiki-staging I guess they did not get migrated or got moved back to `wikidev` after someone ran the fix staging perms script [13:33:56] Yeah there's only 2 in patches [13:34:17] ok I'll run the script before moving on to the next patches yeah? [13:34:32] sounds good [13:34:55] and us non root are in both the `deployment` and the legacy `wikidev` groups [13:35:16] so I think it is fine to update the group for the `/srv/mediawiki-staging` files [13:35:31] and I am running the train this week, so if something breaks I can self blame :D [13:35:58] (and me running the train is why I remembered about that series) [13:36:08] some symlinks are owned by wikidev aswell, that's the only thing left after runnign the script [13:36:33] hmm [13:36:52] I guess cause `chrgrp` change the group of the target rather than the link itself [13:37:29] yeah, and the link doesn't change group [13:37:31] yeah `chgrp --no-dereference` [13:37:36] it's no big deal though [13:37:53] when the default is to dereference it [13:41:20] claime: https://gerrit.wikimedia.org/r/c/operations/puppet/+/978541 [13:41:45] I imagine one could add a symlink `foobar` pointing to some secret and that will make the secret owned by the `deployment` group [13:42:30] which looks like has always been an issue [13:42:43] * hashar hates computers [13:48:14] The permissions on the symlink itself are largely immaterial, it's usually 777 anyways (for the symlink itself), and will be transparent to most things checking ownership [13:48:21] But might as well make it clean [13:54:39] hashar: scap3 part of the patch merged and no-ops on deploy2002 [13:54:57] Do you want to check something before I move on to the last patch? [13:56:04] the deployment_group is set via hiera [13:56:07] so it is always set [13:56:15] and if Puppet ran fine, that means it is indeed set [13:56:30] All right, moving on [13:56:50] that patch was merely to avoid Puppet to magically use the obsolete `wikidev` default and instead have it to bail out cause it is missing a value [13:56:54] so yeah noop [13:57:24] and the next one is marked as in conflict by gerrit due to the `chgrp --no-defereference` patch I have sneaked in [13:57:30] yep [13:57:34] let me rebase it [13:57:40] ack [13:58:00] honestly, I am not sure why I went to craft 3 (well 4) different patches :) [14:00:32] so yeah trivial rebase [14:00:49] that last https://gerrit.wikimedia.org/r/c/operations/puppet/+/927676/ is merely to set the set-group-id on the directories [14:02:26] and even though /srv/patches has it, the `.git` directory beneath it does not have it [14:02:36] cause ... I don't know :) [14:03:49] and the git files are owned by uid 2246 which is Chris Steipp and he has left the foundation a while ago [14:04:12] so yeah I guess all of that is legacy / have been "broken" for ages [14:09:51] hashar: all done [14:10:15] you even ran the script to set the g+s bits on /srv/patches/.git \o/ [14:14:51] claime: all set, thank you so much :) [14:46:03] 10serviceops, 10LandingCheck, 10MW-on-K8s, 10MediaWiki-extensions-WikimediaEvents, and 4 others: PHP Warning: geoip_country_code_by_name(): Required database not available at /usr/share/GeoIP/GeoIP.dat. - https://phabricator.wikimedia.org/T352156 (10Jdforrester-WMF) [14:48:30] 10serviceops, 10SRE: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Papaul) [14:55:34] 10serviceops, 10SRE: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This LGTM now ` /dev/md0: Version : 1.2 Creation Time : Thu Sep 21 12:32:55 2023 Raid Level : raid1 Array Size : 937267200 (... [15:12:36] hi serviceops team, a naive question for you: for how much longer do we expect to have bare-metal appservers? :) [15:49:54] cdanis: good scenario, bad scenario, the in-between ? [15:50:00] what's your pick ? [15:50:35] akosiaris: give me a range, I like ranges [15:51:18] we could be bare-metal appserver free from anything between 2024-07-01 and 2024-12-31 [15:51:31] that helps! [15:51:33] thank you <3 [15:51:49] you are welcome. There's some gotchas that might very well alter the above ofc [15:52:31] e.g. how well the dumps generation from events will end up going, how we will manage to solve the videoscaler problem [15:52:35] I assume the gotchas would make it later, not sooner [15:52:39] yes [15:52:41] 👍 [16:51:52] 10serviceops, 10Machine-Learning-Team: Multiple images fail to build from sources - https://phabricator.wikimedia.org/T350366 (10elukey) [16:52:58] 10serviceops: Multiple images fail to build from sources - https://phabricator.wikimedia.org/T350366 (10elukey) [16:53:37] 10serviceops: Multiple images fail to build from sources - https://phabricator.wikimedia.org/T350366 (10elukey) @Ottomata I see the following error when flink-kubernetes-operator is built: ` 2023-11-29 16:38:45,186 [docker-pkg-build] INFO - --_curl_--https://dlcdn.apache.org/flink/flink-kubernetes-operator-1.4.... [19:43:15] hashar: thanks for the ci deployments <3 [19:47:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10cmooney) @VReilly-WMF just a heads up, for kubernetes1059 I think you selected ssw1-e1-eiqad (this is the spine switch with QSFP ports), rather than lsw1-e1-eqiad (this is the LE... [21:00:32] isaranto: you are welcome :)