[08:03:09] (03PS8) 10Jakob: [DNM] Include OpenSearch in quibble [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) [08:03:57] ^ hashar: bonjour! any chance you could take another look at that? :) [08:04:17] jakob_WMDE: unlikely this week unfortunately. I will see what I can do :) [08:04:35] ok, thank you! [08:04:39] I am running the MediaWiki train this week and thursday is an holiday here [08:04:57] but if it is calm enough hopefully I will have the bandwith on Friday! [08:05:01] (03CR) 10CI reject: [V:04-1] [DNM] Include OpenSearch in quibble [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [08:05:09] or maybe it is straightforward and I would just do it we shall see [08:06:25] jakob_WMDE: have you managed to build the image locally and test it? [08:06:31] s/test/try/ it? [08:06:55] yes :) [08:06:59] ahh good [08:07:19] which is like 80% of the work [08:07:46] so I guess I can manage to multitask the remaining 20% once you get CI Verifiying +1 and that you are happy with [08:08:01] if that does not affect the rest and is behind a feature flag, I guess it is an easy review [08:08:20] the rebuild is automatzed (I'll just run ./fab deploy_docker) from the root of the repo [08:08:29] so yeah keep pinging me :] [08:09:37] ok, that sounds promising, thanks! [09:05:15] (03PS9) 10Jakob: Include OpenSearch in quibble [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) [09:18:07] jakob_WMDE: I did the review of the Quibble patch https://gerrit.wikimedia.org/r/c/integration/quibble/+/1137857 [09:18:08] :) [09:18:17] tldr, drop distutils :) [09:18:19] rest is fine [09:18:33] oh there might be a need to use maintenance/run.php from MediaWiki core [09:18:52] but I don't know whether it can find the maintenance script from an extension. I haven't checked :/ [09:24:37] thanks for the review! I think I tried using run.php for the CirrusSearch maintenance scripts and it didn't work out of the box, but I can take another look [09:28:07] jakob_WMDE: Quibble guarantees the extensions are cloned under $IP/extensions/* [09:28:24] but it could theorically be invoked outside of the path and end up messing things up [09:28:34] whereas maintenance/run.php would resolve the paths for us [09:28:35] but then [09:28:50] don't waste too much time on it [09:29:04] if it does not work and there is no quick/easy fix, just keep and we will use it as-is [09:29:08] distutils should be gone though [09:29:36] and in another change maybe we can roll our own copy of strtobool, but I think it is sufficient to just check for the env variable existence in order to enable the feature [10:54:23] (03PS9) 10Jakob: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) [10:58:58] hashar: I removed the use of distutils, but failed to get getMaintenanceScript() to find the CirrusSearch maintenance script :( [11:55:13] (03CR) 10Jakob: Add OpenSearch (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [12:25:23] (03CR) 10Hashar: Add OpenSearch (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [12:25:49] jakob_WMDE: so at least `php maintenance/run.php 'CirrusSearch\Maintenance\UpdateSearchIndexConfig'` works for me [12:26:16] I gotta debug getMaintenanceScript now :b [12:27:37] hehe :) [12:28:15] I'll change it to using run.php without getMaintenanceScript for now [12:36:10] 10Continuous-Integration-Infrastructure, 10Testing Support, 10ci-test-error (WMF-deployed Build Failure), 10MW-1.44-notes (1.44.0-wmf.23; 2025-04-01), 13Patch-For-Review: Selenium timeouts can cause the job to remain stuck until the build times out - https://phabricator.wikimedia.org/T389536#10776250 (10z... [12:36:30] (03PS10) 10Jakob: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) [12:39:23] (03CR) 10Jakob: Add OpenSearch (032 comments) [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [12:59:14] >>> subprocess.call(quibble.mediawiki.maintenance.getMaintenanceScript('CirrusSearch:UpdateSearchIndexConfig')) [12:59:15] Updating cluster ... [12:59:15] indexing namespaces... [12:59:16] hmm [12:59:34] jakob_WMDE: maybe because you have MW_INSTALL_PATH set to something else [13:11:27] Could not open input file: maintenance/\CirrusSearch\Maintenance\UpdateSearchIndexConfig.php [13:11:30] that is what I got [13:11:38] I am gonna fix that getMaintenanceScript() [13:12:12] hashar: yeah, that's what I've been getting too [13:12:16] cool [13:12:42] so yeah that getMaintenanceScript does not support a class name [13:14:05] hmm? but isn't it only telling us "Could not open input file: maintenance/\CirrusSearch\Maintenance\UpdateSearchIndexConfig.php" because it thinks there is no run.php and then tries to open it like a php file? [13:14:18] I think it would support the class name if it had found run.php [13:14:33] yup it should ideally :b [13:14:39] else: [13:14:39] if ext == '': [13:14:39] cmd = ['php', 'maintenance/%s.php' % basename] [13:15:02] so when it is given` UpdateSearchIndexConfig`, os.path.splitext() gives no extension [13:15:05] exactly, that's what ends up happening =) [13:15:05] the code enter that branch [13:15:22] and end up with a funky maintenance/UpdateSearchIndexConfig.php [13:15:28] so that function is broken in that regard [13:16:21] sorry for the misleading review! [13:16:54] then there is the who depends on who problem [13:16:55] no worries! thanks for the review! [13:17:05] and maybe we need elasticsearch to be added to the image first [13:17:22] so this way we can have your change https://gerrit.wikimedia.org/r/c/integration/quibble/+/1137857 tested with the image [13:17:25] but well [13:17:28] I am lazy this day [13:17:29] s [13:17:52] :D [13:18:20] for the docker image / supervisord config, did the trick `autostart = %(ENV_QUIBBLE_OPENSEARCH)s` work? [13:19:52] yes, that worked! [13:20:26] although I'm realizing that I didn't test that anymore since I changed the quibble code not to use strtobool D: [13:20:48] i.e. I don't know what `autostart = %(ENV_QUIBBLE_OPENSEARCH)s` does when QUIBBLE_OPENSEARCH is empty/unset [13:21:07] * jakob_WMDE tries [13:21:42] ah yeah [13:21:45] what a mess [13:21:45] :/ [13:21:54] I get why you went with strtobool now [13:21:58] * hashar face palms [13:22:03] palm fae [13:22:04] ce [13:22:07] whatever [13:25:33] "Error: not a valid boolean value: '' in section 'program:opensearch'" :( [13:28:26] :( [13:28:27] sorry [13:29:16] so my other review is wrong and we need to import strtobool from distutils [13:31:36] (03open) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [13:32:11] hashar: so the distutils dependency would be ok for now? [13:32:34] https://gist.github.com/hashar/8c08622dae4edfb8c07fb2c7d380f13f [13:32:35] :) [13:33:03] I apologize for the back and forth [13:33:14] let me add that one [13:33:18] or well now [13:33:21] it can be done in your change [13:33:28] you can stick that in quibble.utils [13:33:52] ok! no worries :) [13:33:54] then rollback to before my misleading comment [13:34:23] and maybe leave a comment in Quibble that CI uses QUIBBLE_OPENSEARCH to set autostart=false in supervisor [13:34:46] that would prevent me from refactoring to an empty string :b [13:36:04] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [13:37:03] next to where it's set to "false" in the dockerfile or where would be the best place for that comment? [13:38:22] well there is already a comment in supervisord.conf [13:38:26] that is probably sufficient [13:38:39] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [13:39:51] jakob_WMDE: I am polishing up the Quibble image and will build it [13:39:55] then switch the Jenkins job to them [13:40:40] ok, thanks! [13:41:52] (03CR) 10Hashar: Include OpenSearch in quibble (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [13:42:02] i never know whether I am picky [13:42:13] or have too many ideas surging and overflowing the people I review [13:42:15] or somewhere in between [13:42:24] or that some hamster in my head is spurting random ideas [13:42:25] :b [13:42:40] or it is because I should really stop multitasking [13:43:59] (03PS10) 10Hashar: Include OpenSearch in quibble [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [13:44:08] I tweaked the changelog files [13:44:34] it still using Quibble 1.13.0 [13:44:47] (03CR) 10Hashar: [C:03+2] Include OpenSearch in quibble [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [13:45:28] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [13:46:30] (03Merged) 10jenkins-bot: Include OpenSearch in quibble [integration/config] - 10https://gerrit.wikimedia.org/r/1137108 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [13:47:30] (03PS11) 10Jakob: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) [13:48:58] I am building the images [13:50:07] yay, thanks! [13:50:09] MOUAHAHAH [13:50:22] I am confused [13:50:25] and thanks for the speedy reviews! :D [13:50:29] oh no, what happened [13:52:40] (03CR) 10CI reject: [V:04-1] Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [13:52:45] so we do not use that quibble-bullseye image [13:53:00] but we use the docker-registry.wikimedia.org/releng/quibble-buster-php74 [13:53:03] which is based on Buster [13:53:06] and surely should no more be used [13:53:15] and maybe really we should drop php7.4 eventually [13:53:26] I mixed up buster/bullseye/bookworm [13:53:28] :/ [13:53:29] anyway [13:55:35] oh... sorry, I should've checked that, too :| [13:56:08] it is 100% CI fault [13:56:10] it is messy [14:03:02] (03PS1) 10Hashar: Add job to test Quibble with OpenSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1139866 [14:03:43] (03PS2) 10Hashar: Add job to test Quibble with OpenSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1139866 (https://phabricator.wikimedia.org/T386691) [14:04:12] (03PS1) 10Hashar: ci: add script to test OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1139867 [14:04:45] jakob_WMDE: I am adding a new CI job to integration/quibble. It uses the image you have prepared (I have finished building it) and invoke utils/ci-opensearch.sh [14:05:04] https://gerrit.wikimedia.org/r/c/integration/quibble/+/1139867/1/utils/ci-opensearch.sh [14:05:13] nice, thanks! [14:06:09] (03CR) 10Hashar: [C:03+2] Add job to test Quibble with OpenSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1139866 (https://phabricator.wikimedia.org/T386691) (owner: 10Hashar) [14:08:02] (03Merged) 10jenkins-bot: Add job to test Quibble with OpenSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1139866 (https://phabricator.wikimedia.org/T386691) (owner: 10Hashar) [14:10:55] (03PS12) 10Jakob: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) [14:21:11] ugh, "TimeoutError: Could not connect to port 9200 after 20 seconds" :/ [14:21:35] but I think the fact that it didn't exit with a bad status before that means that it's just slow? [14:23:19] (03PS1) 10Hashar: zuul: set QUIBBLE_OPENSEARCH for Quibble opensearch job [integration/config] - 10https://gerrit.wikimedia.org/r/1139870 (https://phabricator.wikimedia.org/T386691) [14:23:27] jakob_WMDE: ^:) [14:23:50] that is to make CI to set QUIBBLE_OPENSEARCH=true [14:23:54] on that fullrun opensearch job [14:24:01] that needs to happen when supervisord starts [14:24:50] that should be the correct one [14:24:55] I'm confused. it looks like it was already trying to start in https://integration.wikimedia.org/ci/job/integration-quibble-fullrun-opensearch-php74/2/console [14:25:06] oh [14:25:20] (03Abandoned) 10Hashar: zuul: set QUIBBLE_OPENSEARCH for Quibble opensearch job [integration/config] - 10https://gerrit.wikimedia.org/r/1139870 (https://phabricator.wikimedia.org/T386691) (owner: 10Hashar) [14:25:28] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [14:26:34] could it be listening on an other port? [14:26:41] or maybe it takes more than 20 seconds to start [14:26:42] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [14:27:11] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [14:27:16] pretty sure the port is correct [14:27:26] (03CR) 10CI reject: [V:04-1] Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [14:28:13] taking more than 20s could be. it took longer than 10s on my laptop [14:32:39] (03update) 10jnuche: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) [14:35:17] well you can try raising it [14:35:39] also my child change https://gerrit.wikimedia.org/r/c/integration/quibble/+/1139867 should be squashed into your change [14:39:23] (03PS13) 10Jakob: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) [14:43:28] (03PS14) 10Jakob: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) [14:50:02] jakob_WMDE: ah [14:50:15] the job uses buster-php74:1.13.0-s1 [14:51:12] oh, and we still only have it in the bullseye image? [14:51:15] that would explain it :) [14:51:34] I screwed it up [14:58:00] oh [14:58:05] I think I have found the issue [14:59:33] (03CR) 10CI reject: [V:04-1] Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [14:59:59] :O [15:00:02] what is it? [15:00:10] (03PS1) 10Hashar: jjb: fix Quibble fullrun always having buster-php74 image [integration/config] - 10https://gerrit.wikimedia.org/r/1139881 [15:00:58] 06Release-Engineering-Team, 10Scap: Strange scap error after check_testservers_k8s-1_of_2 after running sync-file - https://phabricator.wikimedia.org/T392910 (10sbassett) 03NEW [15:01:27] jakob_WMDE: the job template was hardcoded with buster-php74 [15:01:34] php81 got added later but did not remove the hardcoded value [15:01:42] I went to do the same and bam [15:01:47] ah :D [15:02:45] I have updated the job [15:02:59] (03CR) 10Hashar: "The job looks good now!" [integration/config] - 10https://gerrit.wikimedia.org/r/1139881 (owner: 10Hashar) [15:03:12] (03CR) 10Hashar: [C:03+2] jjb: fix Quibble fullrun always having buster-php74 image [integration/config] - 10https://gerrit.wikimedia.org/r/1139881 (owner: 10Hashar) [15:03:45] (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/1139881" [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [15:03:56] I haven't done those kind of stuff for quite a while [15:03:58] I am rusty [15:03:58] (03open) 10dancy: log.py: @version should be "1" [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/779 [15:04:01] (03update) 10dancy: log.py: @version should be "1" [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/779 [15:04:41] jakob_WMDE: I am in a meeting then will check a bit the state of mediawiki train [15:04:47] so we can pursue tomorrow [15:04:47] (03Merged) 10jenkins-bot: jjb: fix Quibble fullrun always having buster-php74 image [integration/config] - 10https://gerrit.wikimedia.org/r/1139881 (owner: 10Hashar) [15:05:10] sounds good. I also have to sign off now [15:05:18] hashar: thanks for all the help! <3 [15:06:31] (03update) 10dancy: log.py: @version should be "1" [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/779 [15:10:55] 06Release-Engineering-Team, 10Scap: Strange scap error after check_testservers_k8s-1_of_2 after running sync-file - https://phabricator.wikimedia.org/T392910#10776917 (10dancy) →14Duplicate dup:03T380958 [15:10:59] 10Deployments, 10Release-Engineering-Team (Radar), 06serviceops, 07Wikimedia-production-error: httpb sometimes fails upon deployment with a HTTP 503 - https://phabricator.wikimedia.org/T380958#10776919 (10dancy) [15:17:45] 06Release-Engineering-Team, 10Scap, 10Dumps-Generation: scap needs to be k8s-cluster aware - https://phabricator.wikimedia.org/T388761#10776929 (10Scott_French) @brouberol - This would require changes to scap, specifically the ability to override the set of environments relevant to a particular deployment (r... [15:20:07] (03Restored) 10Hashar: zuul: set QUIBBLE_OPENSEARCH for Quibble opensearch job [integration/config] - 10https://gerrit.wikimedia.org/r/1139870 (https://phabricator.wikimedia.org/T386691) (owner: 10Hashar) [15:21:56] (03CR) 10Hashar: [C:03+2] "QUIBBLE_OPENSEARCH needs to be set when starting the container since supervisord relies on it to start opensearch and it is the entry poin" [integration/config] - 10https://gerrit.wikimedia.org/r/1139870 (https://phabricator.wikimedia.org/T386691) (owner: 10Hashar) [15:23:25] (03Merged) 10jenkins-bot: zuul: set QUIBBLE_OPENSEARCH for Quibble opensearch job [integration/config] - 10https://gerrit.wikimedia.org/r/1139870 (https://phabricator.wikimedia.org/T386691) (owner: 10Hashar) [15:26:19] (03CR) 10Hashar: "recheck after having CI to set QUIBBLE_OPENSEARCH before starting the container/supervisord (I7e4ea39c77719eda1ff096ea94789aaa63271597)" [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [15:38:13] (03open) 10hnowlan: mw-cli:scripts: add case for mwscriptwikiset [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/168 (https://phabricator.wikimedia.org/T392441) [16:01:41] o/ could I get a review on https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/168 please? I don't have merge rights so a second authoritative set of eyes would be nice [16:03:13] (03merge) 10dancy: mw-cli:scripts: add case for mwscriptwikiset [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/168 (https://phabricator.wikimedia.org/T392441) (owner: 10hnowlan) [16:12:17] thanks dancy! [16:25:52] hashar: this is kind of for your wish https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137840/7 [16:31:11] (03PS15) 10Hashar: Add OpenSearch [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [16:53:43] mutante: reviewed! :) [16:53:53] I am off for dinner+night etc [16:58:04] oh, hah! good point that I was pirating the ASCII art:) [16:58:23] adding Apache license just became automatic without questioning it [16:58:37] (03CR) 10Hashar: "I have changed the _tcp_wait to poke `127.0.0.1` rather than `localhost` and that solved it. I am pretty sure I previously had the issue " [integration/quibble] - 10https://gerrit.wikimedia.org/r/1137857 (https://phabricator.wikimedia.org/T386691) (owner: 10Jakob) [16:59:00] but adding an "echo" and a shebang makes it a new work :p jk [17:06:03] Don't wanna get sued by a piece of software. [17:08:16] AI laywer [17:08:20] *lawyer [17:09:55] lol, yea. I also feel like we are the first ever to care about the license of the cowsay output but it's true. [17:10:20] also dont want to get into discussion with WMF-internal whether Artistic license is ok with us [17:23:12] Project mediawiki-core-doxygen build #10047: 04FAILURE in 4 min 47 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen/10047/ [17:48:04] 10Continuous-Integration-Infrastructure, 07Developer Productivity: Provide recheck option for failed jobs - https://phabricator.wikimedia.org/T392941 (10Jdlrobson-WMF) 03NEW [17:48:58] 10Continuous-Integration-Infrastructure, 07Developer Productivity: Provide recheck option for only failed jobs - https://phabricator.wikimedia.org/T392941#10777830 (10Jdlrobson-WMF) [17:55:05] I trust: urbanecm!.*@user/urbanecm (2admin), .*@user/urbanecmbackup/x-3733651 (2admin), .*@wikimedia/Martin-Urbanec (2admin), [17:55:05] @trusted [18:22:09] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945 (10bd808) 03NEW [18:23:40] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945#10777944 (10bd808) 05Open→03In progress a:03bd808 https://meta.wikimedia.org/wiki/IRC/Instructions#Instructions_for_channel_ops `lang=irc /join #wikimedia-zuu... [18:26:37] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945#10777950 (10bd808) https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/24 `lang=irc [16:57] < bd808> !isspull [16:57] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945#10777951 (10bd808) https://meta.wikimedia.org/wiki/Wm-bot `lang=irc [17:54] < bd808> @add #wikimedia-zuul [17:54] < wm-bot> Attempting to join #wikimedia-zuu... [18:29:07] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945#10777953 (10bd808) https://wmopbot.toolforge.org/help `lang=irc [18:04] < bd808> !join #wikimedia-zuul [18:04] < wmopbot> Joined ` [18:30:14] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945#10777956 (10bd808) https://wikitech.wikimedia.org/wiki/Tool:Stashbot#Joining_a_new_channel `lang=irc [18:11] < wm-bot> !log bd808@tools-bastion-12 tools.stashbot... [18:30:32] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for discussion and coordiation - https://phabricator.wikimedia.org/T392945#10777959 (10bd808) 05In progress→03Resolved https://meta.wikimedia.org/w/index.php?title=IRC/Channels&diff=prev&oldid=28635289 [19:25:29] bd808: wanna do something with the wm-bot in here? [19:25:38] I got pinged by the trusted listing [19:27:20] 10Scap (SpiderPig 🕸️), 06Infrastructure-Foundations: Add deployment group users to spiderpig-access ldap - https://phabricator.wikimedia.org/T392958 (10thcipriani) 03NEW [19:39:50] 06Release-Engineering-Team, 10Projects-Cleanup, 06translatewiki.net, 07Essential-Work: Archive the analytics/gobblin-wmf Gerrit repository - https://phabricator.wikimedia.org/T392854#10778237 (10amastilovic) Thank you @thcipriani ! [19:40:22] 10Continuous-Integration-Infrastructure (Zuul upgrade): Setup IRC channel for Zuul Upgrade discussion and coordination - https://phabricator.wikimedia.org/T392945#10778241 (10Aklapper) [19:42:49] Hey folks! I have a selenium job that is currently hanging: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-selenium/88019/console. Would someone be willing to gather some debug information from the agent, for T389536? [19:42:50] T389536: Selenium timeouts can cause the job to remain stuck until the build times out - https://phabricator.wikimedia.org/T389536 [19:43:36] Uhm actually, maybe it isn't stuck. But still, something is wrong with that job, so debug information would help. Chrome logs in particular. [19:44:44] (And specifically see if we still get the crash observed in https://phabricator.wikimedia.org/T389536#10675707) [19:49:22] 06Release-Engineering-Team, 10Projects-Cleanup, 07Essential-Work: Archive the analytics/gobblin-wmf Gerrit repository - https://phabricator.wikimedia.org/T392854#10778251 (10Amire80) [19:51:40] Also, https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=All seems to be on fire [20:10:24] yes...there are a massive number of ffmpeg processes running on integration-agent-docker-1048 [20:10:36] load average: 232.20, 234.51, 231.36 [20:14:30] I'm filing a task to document it. The obvious culprit would be the core patch that runs 100x selenium [20:15:05] wmf-quibble-selenium-php81 is the job that's running, currently [20:15:27] trying to gather more but the box is...hard to use :) [20:17:02] 10Continuous-Integration-Infrastructure: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963 (10Daimona) 03NEW [20:17:02] Task filed: T392963 [20:17:03] T392963: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963 [20:17:07] thcipriani: got root. want me to killall ffmpeg? [20:17:20] Now I can kill the offending job with a task reference and feel better ;) [20:18:53] ah, wait, looks like it was running two jobs, probably for the same patch, one for mediawiki-quibble-selenium-vendor-mysql-php74 as well [20:19:28] Killed both jobs, let's see. [20:19:58] ffmpeg processes still running..so far [20:20:16] well this'd do it :D https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721790/25/package.json [20:21:06] Normally that'd be fine. But that patch is in conjunction with a wdio version bump which I think breaks the ffmpeg termination logic. [20:23:10] 10Continuous-Integration-Infrastructure: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778330 (10Daimona) From `#wikimedia-releng`: ` yes...there are a massive number of ffmpeg processes running on integration-agent-docker-1048 That didn't do much huh? [20:24:24] it sure is busy and swapping. but doesnt have a disk space issue and I can use the shell. [20:24:59] Are there still ffmpeg processes running? [20:25:14] yes. they are still there. kill? [20:25:23] could probably kill the docker container [20:25:41] If there are too many of them, yeah, I'd say kill. [20:26:42] ps aux | grep ffmpeg | wc -l [20:26:42] 59 [20:26:48] killall -9 ffmpeg [20:26:48] root@integration-agent-docker-1048:/var/log# ps aux | grep ffmpeg | wc -l [20:26:51] 1 [20:27:12] Great, thanks. [20:27:20] -9 shouldn't be necessary but it was [20:27:41] I don't understand though: why is a single agent bringing everything down? Isn't there supposed to be any safeguard? [20:28:19] !log integration-agent-docker-1048.integration - killall -9 ffpmeg - T392963 [20:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:28:26] T392963: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963 [20:29:50] that grafana board you linked earlier.. load going down. but still shows that other VMs are down [20:29:54] Also... I'm getting a 403 for https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1048/builds [20:30:25] -1062 - reported as down [20:30:40] -1063 - down [20:31:00] I seem to recall a jenkins feature that we disabled and responds with 403, which was discussed recently somewhere. Is this it? [20:31:40] aha, we now have 28 instances up. just a minute ago it was only 25 up [20:32:17] They seem to be recovering, yes. [20:35:58] https://integration.wikimedia.org/ci/computer/ - 0 of 3 executors busy now [20:35:59] 06Release-Engineering-Team, 06serviceops: train presync failed - https://phabricator.wikimedia.org/T387823#10778370 (10akosiaris) Change to allow #release-engineering-team members to start train-presync, train-clean and view logs has been merged and deployed. [20:37:09] Daimona: would it make sense to click rebuild on your original selenium job now? [20:37:16] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-selenium/88019/console ? [20:38:35] The patch is already in gate-and-submit, so waiting to be merged... Sooner or later. Maybe later... [20:39:06] ok [20:39:34] I've aborted https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php74/25384/console to help unblock the queue [20:39:41] Other patches for that change were already failing. [20:39:42] urbanecm: I was just checking the current @trusted list here to see the config. I was setting up things in a new #wikimedia-zuul channel and trying to remember what was common. [20:41:08] I should've said: I clicked the thingy to abort the job. But it doesn't seem to be responding. [20:42:02] Alright, it suddenly did. This confirms that calling jerkins out in IRC is surprisingly effective at unblocking stuff. [20:43:03] ;) [20:43:43] It never hurts to tell Jerkins to behave :) [20:43:58] is it weird why this new one failed ? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1139932 [20:44:19] FAILURE No change detected against the current configuration. [20:44:30] change looks like it does change configuration [20:46:43] The file name is the same, so that's probably interpreted as unchanged config? [20:47:14] if there was an IRC command to tell Jenkins to behave it should be something like !Leeeeroy [20:47:40] nod, was just curious and to check if CI works as normal now [20:47:57] This looks problematic though https://integration.wikimedia.org/ci/job/mwext-php74-phan/92825/console [20:48:21] Phan was killed due to low memory, but this is from just a few minutes ago. [20:49:45] agent 1051 still struggling it seems https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=integration-agent-docker-1051&from=now-24h&to=now [20:50:26] But reported as idle https://integration.wikimedia.org/ci/computer/integration-agent-docker-1051/ [20:50:37] Alright, how many ffmpeg processes running there? [20:51:38] (Also, LOL for the !Leeeeeroy) [20:57:27] Also apparently low on memory: 1040, 1041, 1044, 1047, 1051, 1062, 1063, 1064 [20:58:08] And by "low" I mean that the line in the graph is touching the X axis [20:59:47] May be worth checking them to see if there's a suspicious amount of ffmpeg processes running. There should never ever be more than 10-15 on a single agent at once (and that's already a worst-case scenario). [21:03:26] back. logging in on 1051 ..fails [21:03:50] Nice! [21:03:51] my bad. got shell. [21:03:58] yes, lots of ffmpeg [21:04:16] killing them [21:04:51] !log integration-agent-docker-1051.integration - killall -9 ffmpeg - T392963 [21:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:04:53] T392963: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963 [21:05:24] The incantation seems to have worked: https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=integration-agent-docker-1051&from=now-15m&to=now&viewPanel=40 [21:07:04] load average: 33.52, 136.06, 189.46 [21:08:34] 10Continuous-Integration-Infrastructure: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778428 (10Daimona) FTR, this is being discussed in IRC: https://wm-bot.wmcloud.org/browser/index.php?start=04%2F29%2F2025&end=04%2F29%2F2025&display=%23wikim... [21:08:35] looks like something restarted it [21:08:51] how many? [21:09:05] only 24 [21:10:21] its running ffmpeg but seems like that's about half [21:10:47] But... The agent is idle https://integration.wikimedia.org/ci/computer/integration-agent-docker-1051/ [21:11:36] Maybe killall again? Meanwhile I'm aborting jobs where there's already a failure [21:11:50] ok [21:12:13] done. down to 1. [21:12:23] ehm. 0 :) [21:13:03] the top process is now git [21:13:13] npm ci [21:13:31] Makes sense, now it's running a real job [21:14:02] yea, looks like normal, no ffmpeg.. instead php, lua and whatnot [21:14:11] So I'm assuming the other agents I listed above are also flooded by ffmpeg? [21:14:26] And those would be: 1040, 1041, 1044, 1047, 1062, 1063, 1064 [21:14:54] checking [21:16:45] 1040 ☑️ [21:17:21] Yep, it went stonks https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=integration-agent-docker-1040&viewPanel=4&from=now-5m&to=now [21:18:00] 1041 ☑️ [21:18:35] 10Phabricator (Upstream), 07Upstream: Modified files not counted in total when attaching files - https://phabricator.wikimedia.org/T380361#10778444 (10valerio.bozzolan) @Mahabarata73 thanks again for your report. Can I ask how have you discovered this problem? Are you a translator? (I think yes) Or, have you... [21:19:05] 1044 ☑️ [21:19:37] on each of them: yes, ffmpeg, and killall more than once [21:19:50] they come back though to some extent [21:19:53] Sigh [21:20:05] Some of those agents are running actual jobs so that's expected [21:20:22] (03open) 10dancy: spiderpig: Send HTTP access log to syslog if use_syslog enabled [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/780 [21:20:25] (03update) 10dancy: spiderpig: Send HTTP access log to syslog if use_syslog enabled [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/780 [21:20:45] 1047 - not busy, no action [21:21:04] But even if an agent happens to be running 3 selenium jobs (3 being our current max concurrency), and each of those has parallel selenium enabled with 4 threads, there should never be more than 12 ffmpeg processes at any given time [21:21:25] ok, good to know 12 is the number [21:22:12] Sorry, 1047 was fine. It's 1048 that seems busy again [21:22:25] 1062 - was very busy. killed ffmpeg. now 5 processes [21:22:30] It seems to be choking slowly https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=integration-agent-docker-1048&from=now-1h&to=now&viewPanel=40 [21:23:21] 1048 - number of ffmpeg proces = 48. killing [21:24:01] now 4 procs [21:26:01] 1063 - 55 ffmpeg procs - killall'ed [21:28:23] 1064 - 52 ffmpeg procs - killall'ed [21:28:35] that's it? last one was the slowest to even get on [21:28:58] (03update) 10dancy: spiderpig: Send HTTP access log to syslog if use_syslog enabled [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/780 [21:29:26] Thanks! All instances are up now. [21:29:33] great [21:29:33] Let me go through them one by one again [21:31:00] Is 1044 ok? ~50% available memory but reportedly idle [21:31:28] 10Continuous-Integration-Infrastructure: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778467 (10Dzahn) ` 21:14 < Daimona> So I'm assuming the other agents I listed above are also flooded by ffmpeg? 21:14 < Daimona> And those would be: 1040, 10... [21:31:31] (03update) 10dancy: spiderpig: Send HTTP access log to syslog if use_syslog enabled [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/780 [21:32:24] 1051 is quite low on memory too, but currently running stuff so will come back to it later. [21:34:17] Checked all of them and the rest is fine. I would double-check 1044 and 1051 for ffmpeg processes [21:34:24] Daimona: 1051 - also has 50 ffmpeg [21:34:29] ... [21:34:53] Maximum allowed given current jobs is 0, and I think 50 > 0 [21:35:08] 1044 - 17 ffmpegs [21:35:17] killed on 1051, left 1044 alone [21:35:28] Maximum allowed for 1044 is also 0 given current jobs [21:35:53] killed on 1044 as well [21:36:26] Thank you! Will keep checking the graphs for both, just in case they go stonks again [21:37:48] I think we should set a timeout when spawning those ffmpeg jobs anyway. I'll do that. [21:38:48] (03merge) 10dancy: spiderpig: ensure each interaction is notified only once [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/778 (https://phabricator.wikimedia.org/T392487) (owner: 10jnuche) [21:44:18] sounds good. ty, out for lunch now [21:49:57] (03open) 10volker-e: releases: Bump Codex to 2.0.0-rc.1 [repos/ci-tools/libup-config] - 10https://gitlab.wikimedia.org/repos/ci-tools/libup-config/-/merge_requests/73 (https://phabricator.wikimedia.org/T391012) [21:54:48] Leaked picture of agent-1051 right now: https://phabricator.wikimedia.org/F59561871 [21:55:27] 1044 not looking too healthy either [22:00:32] lol @ meme. cleaned up! but at this point it seems like it will come back anyways? [22:01:31] I imagine these could be from the selenium test retries, so it should stop eventually [22:01:40] As we only allow 1 retry for each test [22:02:08] selenium is the gift that just keeps on giving [22:02:11] ok [22:02:20] it really is [22:03:06] Ahem, what's going on in 1040? [22:04:21] had 43 processes. not anymore [22:04:49] 1062 also... [22:05:01] and 1064 [22:05:14] they do keep coming back it seems... [22:06:31] old-man-yells-at-ffmpeg.jpg [22:06:38] yes, it does. always the same issue again [22:08:06] I did those 2 as well but yea... [22:08:46] 1048 also not looking well [22:08:56] Does ffmpeg run outside of a Docker container for these tests? Trying to reason about where the processes would leak and how we could clean them up. [22:09:05] There must be a nicer way to do this right? [22:09:19] yes, it is not inside a container [22:09:43] they are just ffmpeg processes run by user nobody [22:10:25] ffmpeg -f x11grab -video_size 1280x1024 -i :94 -loglevel error -y -pix_fmt yuv420p /workspace/log/API-Missing-Page-should-not-exist-2025-04-29T21-49-01-513Z.mp4 [22:10:56] That is https://gerrit.wikimedia.org/g/mediawiki/core/+/d3090254b0e8b2284b100d77e32c18155df75f0a/tests/selenium/wdio-mediawiki/index.js#65 [22:12:59] stopVideo is ffmpeg.kill( 'SIGINT' ); .. [22:13:21] maybe I should send signal 2 (SIGINT) to properly stop them then [22:14:19] do you think they come back because they know they failed to complete the command [22:16:28] Maybe? I thought it was due to test retries, but on second thought, that is not possible. [22:17:15] on instance 1064 - tried it, sent a SIGINT (2). killall -2. this makes them stop but not abruptly all at once [22:18:02] this reminds me of a classic quote [22:18:12] "Generally, send 15, and wait a second or two, and if that doesn't work, send 2, and if that doesn't work, send 1. If that doesn't, REMOVE THE BINARY because the program is badly behaved! " Don't use kill -9. Don't bring out the combine harvester just to tidy up the flower pot. " [22:20:43] watching the number of processes. it just crossed the 12 threshold :/ [22:21:37] Eeeeeew [22:22:39] Is there a node process also? [22:22:48] should I see the video thing under https://integration.wikimedia.org/ci/view/Selenium/ ? [22:23:13] yes, multiple /usr/bin/node [22:23:49] sh -c for i in $(seq 1 100); do wdio ./tests/selenium/wdio.conf.js; done [22:23:50] It should be under the artifacts from the build, like in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11735/artifact/log/ [22:23:57] /usr/bin/node /workspace/src/node_modules/@wdio/local-runner/build/run.js ./tests/selenium/wdio.conf.js [22:24:13] Okay yeah, that would be it [22:24:40] there are like 6 of those node processes and 2 of those shell loops [22:24:42] if it keeps spawning selenium tests... [22:24:55] those shell loops you can safely kill everywhere you see them [22:25:03] maintenance-disconnect-full-disks build 697360 integration-agent-docker-1062 (/: 26%, /srv: 100%, /var/lib/docker: 51%): OFFLINE due to disk space [22:25:19] just the loops or also the node though [22:25:35] I guess everything that relates to "wdio.conf" [22:25:38] as for the node, if it's from the loop, it can be killed [22:26:13] we can check one agent at a time to see if there are any legit node processes [22:26:48] ok, looks like killing the shell command and waiting a bit is also enough for it to all go away [22:26:51] another option is to just kill everything, but CI still needs to catch up and it might be preferable not to make that worse [22:26:54] this happened on 1064 now [22:27:25] Okay great [22:27:27] yea, "ps aux | grep wdio" [22:27:39] empty [22:29:34] 1062: similar but only a single shell loop, not 2, and fewer nodes [22:29:45] Alright, so, the agents with suspiciously low memory right now are: 1040, 1044, 1048 [22:30:04] maintenance-disconnect-full-disks build 697361 integration-agent-docker-1062 (/: 26%, /srv: 94%, /var/lib/docker: 51%): RECOVERY disk space OK [22:32:16] 1050 also worth double checking maybe [22:32:35] sorry disappeared into meeting and then down a rabbit hole, catching up [22:33:00] 1062 and 1040 - check for changes [22:33:04] should be better [22:33:23] thanks for the cleanup mutante <3 [22:33:32] 1062 recovered, 1040 recovering [22:33:52] thcipriani: :) yw. I am now killing shell loops like this: [22:34:01] sh -c for i in $(seq 1 100); do wdio ./tests/selenium/wdio.conf.js [22:34:15] and any node process that uses wdio.conf.js [22:34:23] can't you docker kill the running container? [22:34:49] they are outside a container [22:35:32] hrm, that is a mismatch for my memory of how this worked [22:35:46] although my memory has been known to become outdated quickly [22:36:45] Just did a pass of killing some redundant jobs. Some failure on agent-1062 due to full disk [22:37:04] thcipriani: can we click somewhere on or near https://integration.wikimedia.org/ci/view/Selenium/ to disable that entire "wdio" test? [22:37:18] /srv full of garbage apparently [22:37:25] it's definetly that "wdio" conf [22:38:06] one of the maintenance jobs should recover 1062 if /srv is full up [22:38:17] if / gets full it's a manual thingy [22:38:34] 1044 and 1048 - also cleaned up [22:38:35] runs some docker cleanup and brings it back online [22:38:42] Yep I see it's improving. I guess it went "oh btw here are the 500 ffmpeg video captures you asked for" [22:39:11] thcipriani: so it's basically "ps aux | grep wdio" to see it all at once and if it's gone [22:39:26] most instances had 1 of those "for 1 in 100" shell loops [22:39:33] and like 4 to 6 node processes [22:39:48] but at least one had 2 of the loops at the same time [22:39:50] so are we having to kill all selenium jobs at the moment, is that what's happening? [22:40:14] afaict not all selenium jobs. just the one that creates videos [22:40:57] well in theory all of them create videos. But the ones with the shell loop use wdio 8 which seems broken [22:41:14] thcipriani: https://gerrit.wikimedia.org/g/mediawiki/core/+/d3090254b0e8b2284b100d77e32c18155df75f0a/tests/selenium/wdio-mediawiki/index.js#65 [22:42:29] on 1048 it already came back again [22:43:11] that's a legit job [22:44:12] the "for in in $(seq 1 100)" in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721790/25/package.json that you linked to earlier [22:44:40] So, everything with the shell loop is evil and can be killed on sight [22:44:56] what if that would just be reverted ? [22:45:09] Things without the loop can be legit jobs. But they might also use wdio v8 which is evil. I don't think you can tell them apart by looking at just the command [22:45:10] wait, that's a WIP change [22:45:33] Correct. It's never been merged. But somehow it outlived the container [22:46:27] maybe that was the case on 1048 just now and it finished legit jobs [22:46:41] because ffmpeg and the node went away without me doing anything this time [22:47:09] I'm also killing some jobs so yeah [22:47:59] One way to do this could be to check the agents one by one and see if their node jobs are legit. But surely we can do better? [22:48:06] s/jobs/processes/ [22:48:24] there is a cumin instance in integration...or there was [22:48:41] as you said earlier. check number of ffmpeg processes and if it's over 12 then bad, otherwise leave alone [22:48:43] we could also write some groovy in jenkins [22:49:19] the giant load spike is looking much better: https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=All [22:49:38] even within the same agent there might be a mix of good and evil [22:50:01] and we dont know why it started now and not before? [22:50:03] maintenance-disconnect-full-disks build 697365 integration-agent-docker-1062 (/: 26%, /srv: 99%, /var/lib/docker: 47%): OFFLINE due to disk space [22:50:09] thcipriani: I see a integration-cumin.integration.eqiad1.wikimedia.cloud instance [22:50:48] * bd808 sinks back into the bushes [22:51:00] we cant just disable the one test that uses that "wdio 8"? [22:51:01] according to cumin "O{project:integration}" "ps aux | grep ffmpeg | wc -l" there is no host over 12 [22:51:23] great! [22:51:47] if it stays like that.. it's because on some instances the shell loop had not been killed. only ffmpeg itself [22:51:56] well..except integration-cumin but that was due to the grep :D [22:51:57] and then later all of it [22:52:41] 12 is the absolute max though. It's still possible that there are evil processes somewhere. [22:52:57] cumin "O{project:integration}" "ps aux | grep [f]fmpeg | wc -l" show 8 hosts with 1 and 24 hosts with 0 [22:53:02] ;) a bunch of the numbers I mentioned you can also subtract 1 because I didnt bother to | grep -v grep [22:53:44] Could I have a list of `wdio` processes across all agents? [22:53:50] thcipriani: maybe let's count any process that has string "wdio" in it [22:53:52] hah [22:53:53] So I can cross-reference it with the current jobs [22:54:30] looks like integration-agent-docker-1052.integration.eqiad1.wikimedia.cloud is the only host running wdio afaict [22:55:10] That's legit [22:55:11] eh.. it seems I had never connected to 1052 [22:55:32] that has one of the shell loops and a single node [22:55:40] but not 2 and 8 or more ..and no ffmpeg [22:55:46] so seems more legit indeed [22:56:09] It should be https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/11748/console [22:57:16] legit wdio: [22:57:18] node /workspace/src/extensions/ProofreadPage/node_modules/.bin/wdio tests/selenium/wdio.conf.js [22:57:26] /usr/bin/node --no-wasm-code-gc /workspace/src/extensions/ProofreadPage/node_modules/@wdio/local-runner/build/run.js tests/selenium/wdio.conf.js [22:57:38] bad wdio from earlier: [22:57:40] /usr/bin/node /workspace/src/node_modules/@wdio/local-runner/build/run.js ./tests/selenium/wdio.conf.js [22:57:49] if that makes sense [22:57:55] load average looking good. looks like mischief managed. Just got to make sure that we run down whatever is going on with wdio 8 [22:58:41] and probably don't run it in a loop until we do :) [22:58:48] ✅ [22:58:57] I'm not sure if the difference in the invocation is significant, but at any rate, we should be good [22:59:14] why does it run in a loop when the change adding a loop is not merged [22:59:34] Loops should be ok per se. I too have done it many times. It doesn't cause harm as long as everything is working correctly... Which doesn't be the case with wdio 8. [23:00:03] maintenance-disconnect-full-disks build 697367 integration-agent-docker-1062 (/: 26%, /srv: 71%, /var/lib/docker: 46%): RECOVERY disk space OK [23:00:50] But on the other hand, I imagine that the patch in question really was trying to figure out what's wrong with wdio 8. [23:01:41] yeah, probably not anticipating it would eat all CI resources for some reason [23:01:56] seems like a legit way to find flakiness [23:02:41] It's been really useful in the past. It surely didn't have these side effects with wdio 7 [23:03:54] random guess: `afterTest` is no longer run? [23:04:03] so "stopVideo" never gets called [23:04:33] so it was starting an ffmpeg process for every test and never stopping it [23:06:24] That is my understanding. I think I saw something along those lines in the log. But why did it outlive the container? [23:06:34] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778630 (10Daimona) >>! In T392963#10778428, @Daimona wrote: > - Is the theory in T392963#10778330 correct? Based on what we know now:... [23:07:21] (03update) 10bd808: SpiderPig: auto select first backport search match [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/731 (https://phabricator.wikimedia.org/T392508) [23:07:57] could be that we killed the parent process and there was nothing in the container to reap the child processes? Dunno [23:10:12] Yeah no idea. But I think it used to work fine with wdio v7 [23:14:37] At any rate, I left it as a "to figure out" in the task. The grafana dashboard looks much better now, thanks mutante for destroying all the ffmpeg crap. Now I'll disappear :) [23:14:50] ^ thanks both [23:17:12] oh: https://integration.wikimedia.org/ci/job/mediawiki-quibble-selenium-vendor-mysql-php74/25118/console so it looks like chrome was crashing over and over in the afterTest hook, probably causing the stopVideo afterTest hook to never be executed...for some reason. Anyway, I'll dump that theory in the task. [23:24:08] the "stopVideo" code says it is sending SIGINT (signal 2). on one host I used that (killall -2 ffmpeg), which made the processes stop but one by one and a bit more proper. as opposed to just hard kill -9 on others. [23:27:11] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778640 (10Dzahn) We saw ffmpeg processes come back after being killed.. then figure out there were shell loops (sh -c for i in ...) as... [23:30:40] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778646 (10Dzahn) Where ffmpeg gets spawned: https://gerrit.wikimedia.org/g/mediawiki/core/+/d3090254b0e8b2284b100d77e32c18155df75f0a/t... [23:33:30] 10GitLab (Pipeline Services Migration🐤), 06collaboration-services, 13Patch-For-Review: Move micro sites from Ganeti to Kubernetes and from Gerrit to GitLab - https://phabricator.wikimedia.org/T300171#10778647 (10Dzahn) I removed the static-rt site from the legacy miscweb VMs. Now os-reports (T350794) is the... [23:34:03] 10GitLab (Pipeline Services Migration🐤), 06collaboration-services, 13Patch-For-Review: Move micro sites from Ganeti to Kubernetes and from Gerrit to GitLab - https://phabricator.wikimedia.org/T300171#10778650 (10Dzahn) [23:40:20] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778662 (10thcipriani) >>! In T392963#10778640, @Dzahn wrote: > We saw ffmpeg processes come back after being killed.. then figure out t... [23:44:43] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: CI is overwhelmed and lots of jobs are failing randomly (2025-04-29) - https://phabricator.wikimedia.org/T392963#10778676 (10thcipriani) 05Open→03Resolved a:03Dzahn I added a comment on the task that spawned this issue that should point fol...