[00:00:31] hrm: rsync: change_dir "/mediawiki-core/wmf-1.38.0-wmf.19/wmf-quibble-core-vendor-mysql-php72-docker" (in caches) failed: No such file or directory (2) [00:01:28] brennen: In a CI job? That's just the regular castor failure isn't it? [00:01:38] There are so many of them I've become blind to them. :-( [00:01:54] other builds fail with some other reasons [00:01:55] ah, yeah, probably [00:02:11] like src/.git disappearing right when git clone is actually clonign which makes no sense at all [00:02:22] may be a red herring, integration-agent-docker-1033 (bullseye instance) seems to complain of locale problems trying to set LC_ALL=en_US.UTF-8 (in my .profile, I guess) [00:02:24] or MessageEn.php vanishing in the middle of running phpunit [00:02:46] or symptom of the same problem [00:03:03] E.g. on https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71013/consoleFull `rsync: failed to set times on "/cache/.": Operation not permitted (1)` [00:03:24] that is rsync that can set time [00:03:27] Is the new disk system unstable in some fashion? [00:03:32] cause the cache belong to a different user iirc [00:03:34] not an issue [00:03:44] They get unmounted during runtime or something? [00:03:50] no idea [00:04:02] find(1) giving a cannot delete sure feels like the filesystem changing out from under it [00:04:13] beside the OS ugprade , we are now usng instances with ephmereal disks [00:04:34] Yeah. [00:04:43] while previously we had a single disk for os + build ( aka disk80 flavor ) [00:04:46] Which is novel just for us, I believe? No other users yet in WMCS. [00:04:59] and I don't know what it changes on the WMCS backend side. Maybe they are in different Ceph pools with different caches / latency or whatever [00:05:27] We could change the deletion instruction to delete-if-it-can or something, but that is just plastering over the issue (and won't fix .git going away mid-clone). [00:05:32] the other big change is upgrading to Docker 20.10.5 [00:05:39] (previously 18.09.5 iirc) [00:05:41] Yeah. [00:05:51] hashar: How hard would it be to roll back? [00:05:58] so I don't think it is related to castor [00:06:06] it is at a lower level [00:06:16] the disk / partition has files magically vanishing [00:06:30] docker doesn't seem likely to be the problem, either [00:06:36] oh [00:06:38] Agreed. [00:06:44] dont hold your breath on docker not being a problem!!! :D [00:06:55] I've been running docker 20 for months without issues. [00:06:56] experimental ephemeral disks sound like a good thing to investigate [00:07:07] Should we ask in -cloud? [00:07:17] andrewbogott: you about? [00:07:20] but yeah ephemereal might be one [00:07:21] I don't believe in files disappearing without a corresponding unlink() system call (or corresponding filesystem problems logged to dmesg) [00:07:22] are all our instances using that? how can you tell? [00:07:33] Oh, or lovely WMCS cloud-people can just lurk here and be awesome. [00:07:48] it is also pretty random after the few builds I have watched. There is no clear pattern [00:08:01] thcipriani: https://openstack-browser.toolforge.org/project/integration [00:08:27] the integration-agent-docker-XXXX instances are on Bullseye with a disk20 + ephmereal 60 [00:08:32] that's the list of instances, and it looks like hashar built most of them today [00:09:18] the ephmereal 60G disk has a LVM system on it with 24G for /var/lib/docker and 36G for /srv and those partition data can dropped without any problem (just have to stop docker before nuking the partition) [00:09:50] yup I migrated the instances today after we got the new g3 flavor [00:11:07] ok, so if disk is backed by ceph, and we've been running them all day, but just now started having issues, and we suspect ceph backed file system: how do we monitor ceph? Are there unusual latencies there? [00:11:28] Or were the issues happening earlier and we just didn't notice? [00:12:07] (or maybe we haven't been running the instances all day...I guess i notice that 1039 is from 4 hours ago) [00:12:23] Yeah. [00:12:25] ceph generally looks happy -- https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1 [00:12:29] _feels_ unlikely that this was happening terribly long before we noticed it [00:13:04] we only caught it during a time with fairly low typical activity [00:13:11] If Linux has a problem with a filesystem, it logs messages to dmesg and usually remounts the fs read-only. None of that is happening. [00:13:14] that ceph dashboard is incredibly upbeat [00:14:01] It's not just working, it's UP. [00:14:34] I find unlikely for Ceph to magically loose files [00:14:40] dancy: yeah, my first thought was is there anything weird in dmesg, and... well, i don't think so anyway [00:14:43] that would surely cause problems at various otuerhs places [00:15:28] * dancy whispers "concurrency issues" [00:16:06] we could lower the number of executors per machine as an experiment if that's worth trying, but what's the thinking? [00:16:21] I am checking the agents config [00:16:23] Project mediawiki-core-doxygen-docker build #31414: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31414/ [00:16:35] there might be agents in Jenkins actually connected to the same instance [00:17:01] the integration-agent-docker-1038 has no docker images [00:17:40] hrm, same with 1033, but the load was like 22 when I logged in [00:17:53] oh, no, nevermind [00:17:54] [00:17:54] ===== NODE GROUP ===== [00:17:54] (1) integration-agent-docker-1033.integration.eqiad1.wikimedia.cloud [00:17:54] ----- OUTPUT of 'pidof java' ----- [00:17:54] 37884 37874 37873 37872 2006 [00:18:17] which is the wining instance [00:18:29] it is attached multiple time to the jenkins master [00:18:33] it's got images, I just added an extranous argument to docker images and it happily reported no images [00:18:37] huh [00:18:37] Are diffs like this in the agent config "normal"? -- https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1027/jobConfigHistory/showDiffFiles?timestamp1=2022-01-26_18-53-32×tamp2=2022-01-26_18-53-41 [00:18:38] andrewbogott: bd808: it is not ceph nor wmcs [00:18:59] oh the hell it is possible I don't know [00:19:32] so in https://integration.wikimedia.org/ci/computer/ gotta check the IP of each computer [00:19:42] and find out where they are attached [00:21:10] bd808: I guess that was a typo? the old config is an instance named -1026, the ip it has now is for -1027 [00:21:26] hmm [00:21:38] so 1038 instance has no java process [00:21:48] the IP on the instance is 172.16.7.110 [00:22:07] but the agent log at https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1038/log shows it is connected to a different IP: 172.16.2.224 [00:22:23] which is the IP of agent-docker-1033 .. [00:22:38] well, 1033 does have 5 java processes running on it [00:22:43] assuming the config is good for all agent [00:22:45] my internal doubt was if the config change happened before or after the instance "attached" to jerkins. Like if things are cross wired somehow [00:22:50] all running "remoting.jar" [00:22:52] I think the easiest is to kill jenkins [00:22:59] and on restart it would reconnect to the proper machine [00:23:09] so: ssh contint2001.wikimedia.org sudo systemctl restart jenkins [00:24:09] I can do that if it's too late for you [00:24:14] not going to hurt anything to do that [00:24:33] !log restarting jenkins [00:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [00:24:50] I'm out for a walk but reading back scroll [00:25:05] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) It is neither Ceph nor Docker. Somehow multiple agents in Jenkins are connected to the same instance `integrati... [00:25:06] done [00:25:16] so that disconnect all the agents [00:25:28] down to 0 java pids on 1033 [00:25:28] * James_F Let's hope they come back on the right connections. [00:25:32] kill all the builds (good cause the builds test stuff that is more or les undeterminated) [00:25:40] Sounds like it's not a cloud-level thing but lmk if you find otherwise [00:26:08] thanks andrewbogott I don't think we suspect it's cloud level at the moment, but we're still troubleshooting [00:26:13] andrewbogott: yes it is definitely a mis configuration. Multiple jobs running on the same directory so if one delete files the other jobs see files have vanished :@] [00:26:39] ah ha [00:26:42] `DELETE from important_table;` [00:26:43] hashar@integration-cumin:~$ sudo cumin --force 'name:docker' 'pidof java' [00:26:45] looks ok [00:26:56] so it is solved [00:27:06] as to how that ended up like that .... I have zero ideas [00:27:17] now I understand what concurrency issue meant :D [00:27:19] hashar: Is it a detectable situation (read: icinga check) [00:27:21] but bd808 has a point with https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1027/jobConfigHistory/showDiffFiles?timestamp1=2022-01-26_18-53-32×tamp2=2022-01-26_18-53-41 [00:27:30] turning it off and on again strikes once more [00:27:37] dancy: I don't think we have much monitoring available on wmcs [00:27:59] * bd808 points to the IT Crowd boxed set on his shelf and nods at brennen [00:28:17] we could detect the situation of multiple agents on a single host [00:28:22] hashar: Icinga or otherwise... is it detectable [00:28:28] which I'd guess shouldn't happen [00:28:56] but that config diff shows that the agent integration-agent-docker-1027 has been created with the IP 172.16.6.180 which is actually the agent 1026 [00:29:09] btw, why don't we use hostnames instead of IPS? [00:29:18] Something I wondered about from early on [00:29:30] so that was a valid case for "have you tried rebooting?" [00:29:31] especially when trying to figure out what hostname to log into when debugging. [00:30:04] we use IP cause the Jenkins primary is on a production system and the DNS resolver there does not resolve cloud DNS entries (unless something has changed) [00:30:16] ah [00:30:44] in exchange, we get this [00:31:12] contint2001:~$ dig +short integration-agent-docker-1027.integration.eqiad1.wikimedia.cloud [00:31:13] 172.16.6.196 [00:31:17] hey [00:31:18] I guess a similar problem could be caused by supplying the same hostname for two different agents. [00:31:23] There is a prometheus monitoring stack that tools and deployment-prep are hooked up to inside WMCS. I don't think the integration project is setup with it today. [00:31:38] I don't even know to want the glory details that makes the DNS resolution possible from the host [00:31:51] (aside from technical details: thanks for jumping in on troubleshooting this folks. <3 I thought for sure since it's 1am for hashar we'd be on our own...:D) [00:31:52] I just take it as a blessing that indeed we can do dancy suggestion: switch to fqdn [00:33:05] That might be simpler long-term. [00:33:20] * James_F But definitely not a fix for hashar at 01:30. :-) [00:33:32] yeah, appreciate all the help, and sorry hashar :( [00:33:41] hashar!! +100! [00:34:01] OK, do we have an idea of if the reboot fixed things? [00:34:30] Single java thread on each agent == {{done}}? [00:35:08] https://integration.wikimedia.org/zuul/ has more green things than before at least [00:35:17] having multiple agents run the same job on each host sure would explain what we were seeing [00:35:23] Yes. [00:35:45] yeah, much more handily than Weird Filesystem Magic [00:35:53] i also had no idea that could happen [00:36:07] It can if you configure it to [00:36:27] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) The agent configuration were pointing to their WMCS instances IP. https://integration.wikimedia.org/ci/computer... [00:36:29] Computers are silly and will do exactly what you told them to, even when they shouldn't. [00:36:35] dummies! [00:37:15] 10Continuous-Integration-Infrastructure: Switch CI agent config from IPs to FQDN now it's possible - https://phabricator.wikimedia.org/T300224 (10Jdforrester-WMF) [00:37:25] I think there is a separate puppet config issue that is causing failures like https://integration.wikimedia.org/ci/job/publish-to-doc/6703/console (rsync not in PATH on integration-agent-docker-1024.integration.eqiad1.wikimedia.cloud.) [00:37:28] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) p:05High→03Medium Solved by team work over IRC `#wikimedia-releng` with bunch of nice suggestions, leads an... [00:37:29] There, we have a follow-up task. It's all professional and everything. [00:37:34] I wrote the summary [00:37:57] I think bd808 found the actual root cause which is that the agent have the IP changed [00:38:15] cause I have created them by copying the conf from another agent, update the IP then save [00:38:26] but I guess as soon as I create the agent Jenkins immediately connect [00:38:45] and when the IP is changed and the agent is actually saved, Jenkins has already connected on the IP of the other agent [00:38:58] which would be the rationale 2am explanation [00:39:20] hashar: Mad respect for your dedication [00:39:26] yeah well [00:39:32] I am really too tired [00:39:33] post-mortems are best done after sleep :) [00:39:41] go to bed hashar. :) [00:39:42] OK, is it time to try to land the wmf.19 patch? (And let's switch back to -operations.) [00:39:45] Idon't even understand how I managed to figure out all those bits or write all those sentences [00:39:51] oh, we call them "outros" now [00:39:59] ^ [00:40:04] post-mortem is too triggering [00:40:05] I don't like the term postmortem [00:40:17] so the most likely explanation is that when build the Stretch instances 2.5 years ago the exact same issue must have happened [00:40:19] or after-action ... whatever [00:40:25] pre-vivation? [00:40:33] Makes one think of an autopsy and Dr Baden [00:40:34] I picked the term disquisition [00:40:44] and my brain memory just got me on auto pilot redoing the same debug steps [00:41:01] https://www.merriam-webster.com/dictionary/disquisition [00:41:06] :D [00:41:18] Fancy. [00:41:31] thx for all the support / debug etc! if one can drop a quick note to wikitech-l to explain the failures that would be very kind [00:42:04] thx for the quick mitigation :) [00:42:17] :-] [00:42:42] and I really appreciate how many folks step in when something explodes. That is always great to watch [00:43:18] 80% sure there is a task from 2.5 years ago about it [00:43:43] there is no chance I can have done all that debugging above at this time of the day [00:43:56] so I must have done the exact same debugging step a while ago [00:44:01] but can't remember about it cause I am tired [00:44:04] pff [00:44:15] * hashar heads to bed [00:44:31] I think hashar is in denial about how the deep structures of his brain have been warped by running the CI stack for N>9 years [00:44:45] yeah that as well :D [00:45:30] will do post mortem tomorrow I guess [00:45:55] 🌊 [00:46:03] goodnight [00:46:17] * James_F thcipriani: BTW, I saw you made train tasks out to wmf.28, so does that mean you expect the REL1_38 branch to be cut at the end of March not the 15th of March? [00:46:42] * James_F (What, me? Phab-stalking things? Never!) [00:46:44] :D [00:47:14] James_F: it seems you've thought harder than me about the train tasks :) [00:47:18] * James_F grins. [00:47:45] Only that "two releases a year" means 52/2 = 26 weekly alpha cuts. [00:47:50] High-end maths, that is. [00:48:21] my task creation should not be viewed as a change to that policy [00:48:27] Ack. [00:49:26] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071 (10bd808) [00:49:28] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10bd808) [00:50:26] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10bd808) [00:50:49] * James_F thcipriani: I can't edit those magic tasks to point them at 1.39.x unfortunately. [00:51:10] Have created https://www.mediawiki.org/wiki/MediaWiki_1.39/Roadmap for transparency. [01:13:26] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10bd808) [01:18:16] Project mediawiki-core-doxygen-docker build #31415: 04STILL FAILING in 13 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31415/ [01:24:25] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10bd808) This might be caused by the Bullseye base image being more bare bones than prior Debian base ima... [02:15:01] Project mediawiki-core-doxygen-docker build #31416: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31416/ [03:10:27] Project mwcore-phpunit-coverage-master build #1915: 04FAILURE in 10 min: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/1915/ [03:15:36] Project mediawiki-core-doxygen-docker build #31417: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31417/ [03:53:11] Project mediawiki-core-phpmetrics-docker build #1160: 04FAILURE in 4 min 10 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-phpmetrics-docker/1160/ [04:16:00] Project mediawiki-core-doxygen-docker build #31418: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31418/ [05:15:21] Project mediawiki-core-doxygen-docker build #31419: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31419/ [06:14:47] Project mediawiki-core-doxygen-docker build #31420: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31420/ [07:11:20] (03PS8) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 [07:14:41] Project mediawiki-core-doxygen-docker build #31421: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31421/ [07:15:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan) [08:00:46] Project Wikibase-phpmetrics-docker build #479: 04FAILURE in 45 sec: https://integration.wikimedia.org/ci/job/Wikibase-phpmetrics-docker/479/ [08:16:16] Project mediawiki-core-doxygen-docker build #31422: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31422/ [08:20:19] 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10Ladsgroup) [08:35:11] pff [08:38:25] bash: line 1: rsync: command not found [08:38:27] fun [08:40:49] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) [08:40:58] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) p:05Triage→03Unbreak! [08:41:45] 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058 (10JMeybohm) `scap pull` and restbase dummy deploy seemed fine. [08:42:18] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) ` integration-cumin:~$ sudo cumin --force 'name:docker' 'which rsync' 18 hosts will be targeted: integration-agent-docker-[1... [08:57:22] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) From the host that works: ` # aptitude why rsync i git-fat Depends rsync ` However `git-fat` is no more available on Bull... [09:05:20] !log integration: cumin --force 'name:docker' 'apt install rsync' # T300214 [09:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:05:21] T300214: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 [09:07:34] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) 05Open→03Resolved a:03hashar [09:07:37] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071 (10hashar) [09:16:28] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) 05Resolved→03Open [09:16:31] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071 (10hashar) [09:16:54] !log integration: cumin --force 'name:docker' 'apt install rsync' # T300236 [09:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:16:56] T300236: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 [09:17:31] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) 05Open→03Resolved [09:19:19] Yippee, build fixed! [09:19:20] Project mediawiki-core-doxygen-docker build #31423: 09FIXED in 15 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31423/ [09:23:20] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10hashar) Sorry I have missed this task this morning. While reading the IRC backscroll I have noticed `me... [09:23:43] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) [09:23:47] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10hashar) [09:26:29] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) From https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/buildTimeTrend {F34932576 size=full} The build... [09:56:29] 10Release-Engineering-Team: Should UI regressions (e.g. no fatals) with the non-default skins ever block the train? - https://phabricator.wikimedia.org/T300169 (10TheDJ) As far as I can tell, "not-supported" has always been a statement to reduce cognitive load on developers when developing new features and not s... [10:07:43] (03PS9) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 [10:11:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan) [10:27:36] 10Release-Engineering-Team: Should UI regressions (e.g. no fatals) with the non-default skins ever block the train? - https://phabricator.wikimedia.org/T300169 (10hashar) The status quo as I understand it is that Vector is the prime skin and the others ones are support on a best effort basis (we would fix things... [10:35:02] (03PS1) 10Kosta Harlan: layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) [10:37:21] (03CR) 10jerkins-bot: [V: 04-1] layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) (owner: 10Kosta Harlan) [10:39:42] (03PS2) 10Kosta Harlan: layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) [11:21:51] hi hashar ^ should be ready to deploy, if you have some time. It's possible the GrowthExperiments tests will start failing though, so please lmk when you are able to deploy that so I can be around to fix things as needed [11:34:58] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Puppet failures on integration-agent-qemu hosts - https://phabricator.wikimedia.org/T299836 (10Majavah) [11:35:07] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Puppet failure on integration-agent-qemu* - https://phabricator.wikimedia.org/T299996 (10Majavah) [11:38:57] hashar: "I don't even know to want the glory details that makes the DNS resolution possible from the host" simply just that we now use a publicly registered domain (wikimedia.cloud), it even works on your laptop [11:43:07] bd808: integration is also hooked to the prometheus stack, but there aren't any special alerting rules (just standards like puppet failures and instances being down) and the alerts only go to #wikimedia-cloud-feed [13:06:20] (03PS10) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 [13:11:15] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757654 [13:11:17] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757654 (owner: 10QChris) [13:11:19] (03PS1) 10QChris: Import done. Revoke import grants [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757655 [13:11:21] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757655 (owner: 10QChris) [13:37:14] (03CR) 10Awight: [C: 03+1] "Looks like it should give a small speed gain: the three steps might even be complementary in using cpu, network, and disk, but unfortunate" [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan) [13:39:30] (03CR) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan) [14:26:02] (03PS11) 10Kosta Harlan: Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (https://phabricator.wikimedia.org/T225730) [14:44:27] (03CR) 10Hashar: [C: 03+2] layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) (owner: 10Kosta Harlan) [14:45:46] kostajh: I am deploying the patch to trigger apitests on GrowthExperiments [14:46:18] (03Merged) 10jenkins-bot: layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) (owner: 10Kosta Harlan) [14:56:31] ok, thanks hashar [14:56:44] kostajh: it is deployed :] [14:57:23] cheers! [14:57:25] * kostajh watches https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php72-docker/26841/console [15:04:18] hashar: it's semi-broken, but let me see if I can fix that quickly without undeploying the jjb patch [15:06:51] eek [15:07:06] might have been better to deploy the CI change first then recheck the patch on the extension to confirm it works [15:07:18] hopefully it is easy to fix? [15:10:55] Yippee, build fixed! [15:10:55] Project mwcore-phpunit-coverage-master build #1916: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/1916/ [15:11:51] hashar: yeah I think it will be [15:12:05] (famous last words) [15:14:05] kostajh: do you midn if I remove the large copyright things from mwcli? [15:14:09] *mind [15:14:52] I think it will make it a little easier to read some of the text and docs explaining things in the files :) [15:15:14] addshore: fine by me [15:16:21] cool! [15:37:36] 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058 (10dancy) [15:37:45] 10Release-Engineering-Team, 10Scap: scap overrides for deploy-local using -D parameter fail - https://phabricator.wikimedia.org/T300177 (10dancy) 05In progress→03Resolved This problem should be fixed now. [15:37:54] tzags. [15:42:45] 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10cscott) @Krinkle found a missed case for the ParserOutput::addModules() deprecation mentioned above, with a patch in http... [15:57:25] (03CR) 10Cwhite: "This change is ready for review." [integration/config] - 10https://gerrit.wikimedia.org/r/755816 (https://phabricator.wikimedia.org/T299431) (owner: 10Cwhite) [16:00:29] !log Pooling back agents 1035 1036 1037 1038 , they could not connect due to ssh host mismatch since yesterday they all got attached to instance 1033 and accepted that host key # T300214 [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:00:31] T300214: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 [16:40:46] strange build behavior: https://integration.wikimedia.org/ci/job/mwgate-node12-docker/74935/console has apparently made no progress since 16:24:38 UTC (ca. 15 mins ago) [16:45:51] It finally failed. [16:46:09] hmm. aborted.. perhaps that was you [16:46:13] not me [16:46:21] ah, Node OOM [16:46:33] I’ve seen those a few times but not usually in this repository I think [16:46:44] and I hadn’t noticed before that they have such a long window of no output before them [16:49:49] interesting, I think I’m also getting increasing memory usage when I run those tests locally [16:50:21] though in a slightly different place in the tests [16:54:07] I abandoned the corresponding change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/726557 for now [16:54:59] Alright. Good luck! [16:58:15] thanks! [17:45:54] 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10Urbanecm) [17:48:10] (03PS1) 10Ahmon Dancy: scap sync-world: Make wikiversions-compile phase verbose [tools/scap] - 10https://gerrit.wikimedia.org/r/757693 [17:49:05] (03CR) 10Ahmon Dancy: [C: 03+2] scap sync-world: Make wikiversions-compile phase verbose [tools/scap] - 10https://gerrit.wikimedia.org/r/757693 (owner: 10Ahmon Dancy) [17:49:45] (03Merged) 10jenkins-bot: scap sync-world: Make wikiversions-compile phase verbose [tools/scap] - 10https://gerrit.wikimedia.org/r/757693 (owner: 10Ahmon Dancy) [18:57:46] 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10brennen) [19:13:20] (03CR) 10Ladsgroup: "It doesn't have a ticket nor explains why in the commit message. Is it not needed? https://phabricator.wikimedia.org/macro/view/6/" [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah) [19:14:50] (03PS2) 10Majavah: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency [integration/config] - 10https://gerrit.wikimedia.org/r/757464 [19:15:09] (03CR) 10Majavah: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah) [19:20:32] (03CR) 10Ladsgroup: [C: 03+1] "Awesome. Do you want me to deploy it?" [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah) [19:22:31] Project mediawiki-core-doxygen-docker build #31433: 04FAILURE in 18 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31433/ [19:27:43] (03CR) 10Majavah: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah) [19:31:23] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah) [19:33:46] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah) [19:34:40] !log Reloading Zuul to deploy 757464 [19:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:40:40] (03CR) 10Hashar: [C: 03+2] logstash-filter-verifier: upgrade logstash to 7.16 [integration/config] - 10https://gerrit.wikimedia.org/r/755816 (https://phabricator.wikimedia.org/T299431) (owner: 10Cwhite) [19:42:21] (03Merged) 10jenkins-bot: logstash-filter-verifier: upgrade logstash to 7.16 [integration/config] - 10https://gerrit.wikimedia.org/r/755816 (https://phabricator.wikimedia.org/T299431) (owner: 10Cwhite) [19:45:14] 10Release-Engineering-Team, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10brennen) [19:46:05] 10Release-Engineering-Team, 10Infrastructure-Foundations, 10Puppet, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10brennen) [20:00:43] (03PS1) 10Cwhite: logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736 [20:01:04] (03PS2) 10Cwhite: logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736 [20:10:51] 10GitLab (CI & Job Runners), 10Security Team AppSec, 10Security-Team, 10Security, 10user-sbassett: Add minimal yaml file linting as part of ci for security ci templates repository - https://phabricator.wikimedia.org/T294596 (10sbassett) A quick blog post that could be helpful to some future work in this... [20:11:48] (03CR) 10Hashar: [C: 03+2] logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736 (owner: 10Cwhite) [20:13:33] (03Merged) 10jenkins-bot: logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736 (owner: 10Cwhite) [20:18:53] Yippee, build fixed! [20:18:54] Project mediawiki-core-doxygen-docker build #31434: 09FIXED in 14 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31434/ [20:26:44] !log Successfully published image docker-registry.discovery.wmnet/releng/logstash-filter-verifier:0.0.2 # T299431 [20:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:26:46] T299431: Upgrade logstash-filter-verifier logstash to 7.16 - https://phabricator.wikimedia.org/T299431 [21:03:00] (03PS1) 10Hashar: jjb: update logstash-filter-verifier [integration/config] - 10https://gerrit.wikimedia.org/r/757746 (https://phabricator.wikimedia.org/T299431) [21:03:24] (03CR) 10Hashar: [C: 03+2] "Job updated!" [integration/config] - 10https://gerrit.wikimedia.org/r/757746 (https://phabricator.wikimedia.org/T299431) (owner: 10Hashar) [21:05:10] (03Merged) 10jenkins-bot: jjb: update logstash-filter-verifier [integration/config] - 10https://gerrit.wikimedia.org/r/757746 (https://phabricator.wikimedia.org/T299431) (owner: 10Hashar) [21:23:24] kostajh: hey, could you resubmit https://gerrit.wikimedia.org/r/c/mediawiki/core/+/753556? It was conflicting with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/735718. All conflicts should be fixed now. [21:24:29] Ok will do [21:25:35] thx [21:43:45] (03CR) 10Krinkle: "How does parallel handle output, does it buffer them internally and then flush as whole chunks whenever one of them finishes? Or continous" [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (https://phabricator.wikimedia.org/T225730) (owner: 10Kosta Harlan) [21:55:41] (03PS1) 10Ahmon Dancy: sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 [22:00:40] (03PS2) 10Ahmon Dancy: sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 [22:01:25] (03CR) 10Ahmon Dancy: "I prepared this commit after a discussion with Timo today." [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 (owner: 10Ahmon Dancy) [22:04:18] (03PS3) 10Krinkle: sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 (owner: 10Ahmon Dancy) [22:04:39] (03CR) 10Krinkle: [C: 03+1] sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 (owner: 10Ahmon Dancy)