[09:15:22] hi there, MW train and backports are currently blocked by this issue https://phabricator.wikimedia.org/T379044 [09:15:49] anyone around right now who could help with it? [09:16:32] <_joe_> jnuche: uh not sure what's going on [09:18:50] 'mwscript eval.php --wiki testwiki' started failing last night with that network failure, but I can't see any obvious reason why that started happening [09:21:07] <_joe_> jnuche: can you try again? [09:21:20] <_joe_> it's pretty clear from the stack trace the problem was rsyslog [09:21:33] will do [09:22:40] ok, running now [09:23:51] _joe_: same failure :( [09:24:26] <_joe_> ok no idea what the problem actually is given mwscript eval.php --wiki testwiki works correctly from the CLI [09:24:38] <_joe_> jnuche: can you point me to the code in scap where this gets evaluated? [09:25:52] the call is implemented here https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/mwscript.py?ref_type=heads, the actual command is at the bottom https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/mwscript.py?ref_type=heads#L287 [09:27:31] <_joe_> what I mean is that eval.php opens a shell in theory [09:27:48] <_joe_> so I guess we're piping php instructions to the subcommand [09:28:45] <_joe_> oh damn it's run within docker [09:29:57] yeah, it runs in docker with `--network none` [09:30:07] <_joe_> uhm [09:30:15] I'm not familiar with the code, but AFAICT the call is literally `mwscript eval.php --wiki testwiki` [09:30:19] <_joe_> maybe that's the issue? [09:30:52] yeah, that is my guess right now, the thing is the dockerized version has been running for at least a week (maybe longer) without issues [09:31:06] so it didn't seem to need the network access before [09:31:53] <_joe_> yeah [09:32:06] <_joe_> but now something is trying to log [09:32:11] <_joe_> and socket_sendto fails [09:32:14] <_joe_> even if it's udp [09:33:06] <_joe_> so I guess the problem isn't at the system level [09:33:17] <_joe_> I'll try to repro [09:33:52] thx [09:34:48] we can always change it to `--network host` but since I don't know why access wasn't given before, that would concern me a bit [09:40:54] <_joe_> yeah I'd like someone to look on the mw side if anything changed in how we run scripts tbh [09:42:28] <_joe_> jnuche: what's self.user in that context? www-data? [09:43:52] checking [09:44:43] yeah, it's `www-data` [09:45:04] who would be a good person to ping? JamesF maybe? [09:45:44] <_joe_> jnuche: no idea but I'm 100% not sure how that script can even run without network [09:45:58] <_joe_> it means for instance that we don't have etcd access [09:46:06] <_joe_> without which you can't configure mediawiki [09:47:38] ok, I'll try to follow up with someone from the MW side [09:47:42] ty _joe_ [09:48:32] <_joe_> jnuche: maybe it's not 100% clear to me what is being done there, as in what's the docker image used there [09:50:21] _joe_: IIRC it was done so we can build MW images on the releases server: https://releases-jenkins.wikimedia.org/job/MediaWiki%20publish%20WMF%20single-version%20image/ [09:52:18] the image is docker-registry.wikimedia.org/php7.4-fpm-multiversion-base [09:52:34] <_joe_> uhm ok [09:53:40] it's part of the group-1 hypothesis, Dan and Bryan had been tinkering around that and wanted to be able to build MW images outside of the deployment server [09:53:58] not super sure about all the details though [09:54:41] <_joe_> so, when did the issue started? [09:55:01] <_joe_> because frankly it looks like a scap/mw interaction problem [09:56:20] the problem started last night, the containerized version has been longer at least for all of last week AFAICS [09:56:30] *has been deployed at least [09:58:03] <_joe_> yeah but when was the last successful scap? [09:59:06] I can see one successful backport yesterday: https://sal.toolforge.org/log/6iIf-ZIBFFSCpsJz1WyO [10:05:05] <_joe_> jnuche: so as I wrote on the task, for now running docker with network=host is ok and safe [10:05:12] <_joe_> to unblock deployments [10:05:54] <_joe_> you all can investigate the issue later, with people who would know how to repro that specific point in code - I guess they have some special config they use in order to not need the network [10:06:47] sounds good, I'll put together a Scap MR [10:53:14] I want to try to rerun the train presync, but I'd be spilling into the SRE infra window [10:53:29] is it ok if I go ahead? was there anything planned for the window? [10:54:07] I don't have anything specific [10:54:39] Unblocking the train is a priority anyways, so I'd say go ahead [10:55:18] ack, ty [16:00:58] I'm looking to merge in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075608 which will drop RSA cert support from various services in modules/profile/templates/idp/client/ (icinga, karma, klaxon, librenms, orchestrator). Any voiced concerns with that would be appreciated :) [21:35:08] nothing from on-call. it was quiet. which I am glad about on this day. now also really going afk.