[00:00:31] <brennen>	 hrm: rsync: change_dir "/mediawiki-core/wmf-1.38.0-wmf.19/wmf-quibble-core-vendor-mysql-php72-docker" (in caches) failed: No such file or directory (2)
[00:01:28] <James_F>	 brennen: In a CI job? That's just the regular castor failure isn't it?
[00:01:38] <James_F>	 There are so many of them I've become blind to them. :-(
[00:01:54] <hashar>	 other builds fail with some other reasons 
[00:01:55] <brennen>	 ah, yeah, probably
[00:02:11] <hashar>	 like  src/.git  disappearing right when git clone is actually clonign which makes no sense at all
[00:02:22] <thcipriani>	 may be a red herring, integration-agent-docker-1033 (bullseye instance) seems to complain of locale problems trying to set LC_ALL=en_US.UTF-8 (in my .profile, I guess)
[00:02:24] <hashar>	 or MessageEn.php vanishing in the middle of running phpunit
[00:02:46] <thcipriani>	 or symptom of the same problem
[00:03:03] <James_F>	 E.g. on https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71013/consoleFull `rsync: failed to set times on "/cache/.": Operation not permitted (1)`
[00:03:24] <hashar>	 that is rsync that can set time
[00:03:27] <James_F>	 Is the new disk system unstable in some fashion?
[00:03:32] <hashar>	 cause the cache belong to a different user iirc
[00:03:34] <hashar>	 not an issue
[00:03:44] <James_F>	 They get unmounted during runtime or something?
[00:03:50] <hashar>	 no idea 
[00:04:02] <brennen>	 find(1) giving a cannot delete sure feels like the filesystem changing out from under it
[00:04:13] <hashar>	 beside the OS ugprade , we are now usng instances with ephmereal disks
[00:04:34] <James_F>	 Yeah.
[00:04:43] <hashar>	 while previously we had a single disk for os + build ( aka disk80 flavor )
[00:04:46] <James_F>	 Which is novel just for us, I believe? No other users yet in WMCS.
[00:04:59] <hashar>	 and I don't know what it changes on the WMCS backend side. Maybe they are in different Ceph pools with different caches / latency or whatever
[00:05:27] <James_F>	 We could change the deletion instruction to delete-if-it-can or something, but that is just plastering over the issue (and won't fix .git going away mid-clone).
[00:05:32] <hashar>	 the other big change is upgrading to Docker 20.10.5
[00:05:39] <hashar>	 (previously 18.09.5 iirc)
[00:05:41] <James_F>	 Yeah.
[00:05:51] <James_F>	 hashar: How hard would it be to roll back?
[00:05:58] <hashar>	 so I don't think it is related to castor
[00:06:06] <hashar>	 it is at a lower level
[00:06:16] <hashar>	 the disk / partition has files magically vanishing
[00:06:30] <thcipriani>	 docker doesn't seem likely to be the problem, either
[00:06:36] <hashar>	 oh
[00:06:38] <James_F>	 Agreed.
[00:06:44] <hashar>	 dont hold your breath on docker not being a problem!!! :D
[00:06:55] <James_F>	 I've been running docker 20 for months without issues.
[00:06:56] <thcipriani>	 experimental ephemeral disks sound like a good thing to investigate
[00:07:07] <James_F>	 Should we ask in -cloud?
[00:07:17] <bd808>	 andrewbogott: you about? 
[00:07:20] <hashar>	 but yeah ephemereal might be one
[00:07:21] <dancy>	 I don't believe in files disappearing without a corresponding unlink() system call (or corresponding filesystem problems logged to dmesg)
[00:07:22] <thcipriani>	 are all our instances using that? how can you tell?
[00:07:33] <James_F>	 Oh, or lovely WMCS cloud-people can just lurk here and be awesome.
[00:07:48] <hashar>	 it is also pretty random after the few builds I have watched. There is no clear pattern
[00:08:01] <bd808>	 thcipriani: https://openstack-browser.toolforge.org/project/integration
[00:08:27] <hashar>	 the integration-agent-docker-XXXX instances are on Bullseye with a disk20 + ephmereal 60
[00:08:32] <bd808>	 that's the list of instances, and it looks like hashar built most of them today
[00:09:18] <hashar>	 the ephmereal 60G disk has a LVM system on it with 24G for /var/lib/docker and 36G for /srv  and those partition data can dropped without any problem (just have to stop docker before nuking the partition)
[00:09:50] <hashar>	 yup I migrated the instances today after we got the new g3 flavor
[00:11:07] <thcipriani>	 ok, so if disk is backed by ceph, and we've been running them all day, but just now started having issues, and we suspect ceph backed file system: how do we monitor ceph? Are there unusual latencies there?
[00:11:28] <James_F>	 Or were the issues happening earlier and we just didn't notice?
[00:12:07] <thcipriani>	 (or maybe we haven't been running the instances all day...I guess i notice that 1039 is from 4 hours ago)
[00:12:23] <James_F>	 Yeah.
[00:12:25] <bd808>	 ceph generally looks happy -- https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1
[00:12:29] <brennen>	 _feels_ unlikely that this was happening terribly long before we noticed it
[00:13:04] <brennen>	 we only caught it during a time with fairly low typical activity
[00:13:11] <dancy>	 If Linux has a problem with a filesystem, it logs messages to dmesg and usually remounts the fs read-only.  None of that is happening.
[00:13:14] <thcipriani>	 that ceph dashboard is incredibly upbeat
[00:14:01] <James_F>	 It's not just working, it's UP.
[00:14:34] <hashar>	 I find unlikely for Ceph to magically loose files
[00:14:40] <brennen>	 dancy: yeah, my first thought was is there anything weird in dmesg, and...  well, i don't think so anyway
[00:14:43] <hashar>	 that would surely cause problems at various otuerhs places
[00:15:28] * dancy whispers "concurrency issues"
[00:16:06] <thcipriani>	 we could lower the number of executors per machine as an experiment if that's worth trying, but what's the thinking?
[00:16:21] <hashar>	 I am checking the agents config
[00:16:23] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31414: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31414/
[00:16:35] <hashar>	 there might be agents in Jenkins actually connected to the same instance
[00:17:01] <hashar>	 the integration-agent-docker-1038 has no docker images 
[00:17:40] <thcipriani>	 hrm, same with 1033, but the load was like 22 when I logged in
[00:17:53] <thcipriani>	 oh, no, nevermind
[00:17:54] <hashar>	                                                                                                                         
[00:17:54] <hashar>	 ===== NODE GROUP =====                                                                                                                                                                        
[00:17:54] <hashar>	 (1) integration-agent-docker-1033.integration.eqiad1.wikimedia.cloud                                                                                                                          
[00:17:54] <hashar>	 ----- OUTPUT of 'pidof java' -----                                                                                                                                                            
[00:17:54] <hashar>	 37884 37874 37873 37872 2006                                              
[00:18:17] <hashar>	 which is the wining instance
[00:18:29] <hashar>	 it is attached multiple time to the jenkins master
[00:18:33] <thcipriani>	 it's got images, I just added an extranous argument to docker images and it happily reported no images
[00:18:37] <dancy>	 huh
[00:18:37] <bd808>	 Are diffs like this in the agent config "normal"? -- https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1027/jobConfigHistory/showDiffFiles?timestamp1=2022-01-26_18-53-32&timestamp2=2022-01-26_18-53-41
[00:18:38] <hashar>	 andrewbogott: bd808: it is not ceph nor wmcs
[00:18:59] <hashar>	 oh the hell it is possible I don't know
[00:19:32] <hashar>	 so in https://integration.wikimedia.org/ci/computer/  gotta check the IP of each computer
[00:19:42] <hashar>	 and find out where they are attached
[00:21:10] <thcipriani>	 bd808: I guess that was a typo? the old config is an instance named -1026, the ip it has now is for -1027
[00:21:26] <hashar>	 hmm
[00:21:38] <hashar>	 so 1038 instance has no java process
[00:21:48] <hashar>	 the IP on the instance is 172.16.7.110
[00:22:07] <hashar>	 but the agent log at https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1038/log shows it is connected to a different IP: 172.16.2.224
[00:22:23] <hashar>	 which is the IP of agent-docker-1033 ..
[00:22:38] <thcipriani>	 well, 1033 does have 5 java processes running on it
[00:22:43] <hashar>	 assuming the config is good for all agent
[00:22:45] <bd808>	 my internal doubt was if the config change happened before or after the instance "attached" to jerkins. Like if things are cross wired somehow
[00:22:50] <thcipriani>	 all running "remoting.jar"
[00:22:52] <hashar>	 I think the easiest is to kill jenkins
[00:22:59] <hashar>	 and on restart it would reconnect to the proper machine
[00:23:09] <hashar>	 so:  ssh contint2001.wikimedia.org sudo systemctl restart jenkins
[00:24:09] <thcipriani>	 I can do that if it's too late for you
[00:24:14] <thcipriani>	 not going to hurt anything to do that
[00:24:33] <thcipriani>	 !log restarting jenkins
[00:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[00:24:50] <andrewbogott>	 I'm out for a walk but reading back scroll
[00:25:05] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) It is neither Ceph nor Docker. Somehow multiple agents in Jenkins are connected to the same instance `integrati...
[00:25:06] <thcipriani>	 done
[00:25:16] <hashar>	 so that disconnect all the agents
[00:25:28] <thcipriani>	 down to 0 java pids on 1033
[00:25:28] * James_F Let's hope they come back on the right connections.
[00:25:32] <hashar>	 kill all the builds (good cause the builds test stuff that is more or les undeterminated)
[00:25:40] <andrewbogott>	 Sounds like it's not a cloud-level thing but lmk if you find otherwise
[00:26:08] <thcipriani>	 thanks andrewbogott I don't think we suspect it's cloud level at the moment, but we're still troubleshooting
[00:26:13] <hashar>	 andrewbogott: yes it is definitely a mis configuration. Multiple jobs running on the same directory so if one delete files the other jobs see files have vanished :@]
[00:26:39] <thcipriani>	 ah ha
[00:26:42] <dancy>	 `DELETE from important_table;`
[00:26:43] <hashar>	 hashar@integration-cumin:~$ sudo cumin --force 'name:docker' 'pidof java'
[00:26:45] <hashar>	 looks ok
[00:26:56] <hashar>	 so it is solved
[00:27:06] <hashar>	 as to how that ended up like that .... I have zero ideas
[00:27:17] <thcipriani>	 now I understand what concurrency issue meant :D
[00:27:19] <dancy>	 hashar: Is it a detectable situation (read: icinga check)
[00:27:21] <hashar>	 but bd808 has a point with https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1027/jobConfigHistory/showDiffFiles?timestamp1=2022-01-26_18-53-32&timestamp2=2022-01-26_18-53-41
[00:27:30] <brennen>	 turning it off and on again strikes once more
[00:27:37] <hashar>	 dancy: I don't think we have much monitoring available on wmcs 
[00:27:59] * bd808 points to the IT Crowd boxed set on his shelf and nods at brennen 
[00:28:17] <thcipriani>	 we could detect the situation of multiple agents on a single host
[00:28:22] <dancy>	 hashar: Icinga or otherwise...  is it detectable
[00:28:28] <thcipriani>	 which I'd guess shouldn't happen
[00:28:56] <hashar>	 but that config diff shows that the agent integration-agent-docker-1027  has been created with the IP 172.16.6.180 which is actually the agent 1026
[00:29:09] <dancy>	 btw, why don't we use hostnames instead of IPS?
[00:29:18] <dancy>	 Something I wondered about from early on
[00:29:30] <hashar>	 so that was a valid case for "have you tried rebooting?"
[00:29:31] <dancy>	 especially when trying to figure out what hostname to log into when debugging.
[00:30:04] <hashar>	 we use IP cause the Jenkins primary is on a production system and the DNS resolver there does not resolve cloud DNS entries (unless something has changed)
[00:30:16] <dancy>	 ah
[00:30:44] <dancy>	 in exchange, we get this
[00:31:12] <hashar>	 contint2001:~$ dig +short integration-agent-docker-1027.integration.eqiad1.wikimedia.cloud
[00:31:13] <hashar>	 172.16.6.196
[00:31:17] <hashar>	 hey
[00:31:18] <dancy>	 I guess a similar problem could be caused by supplying the same hostname for two different agents.
[00:31:23] <bd808>	 There is a prometheus monitoring stack that tools and deployment-prep are hooked up to inside WMCS. I don't think the integration project is setup with it today.
[00:31:38] <hashar>	 I don't even know to want the glory details that makes the DNS resolution possible from the host
[00:31:51] <thcipriani>	 (aside from technical details: thanks for jumping in on troubleshooting this folks. <3 I thought for sure since it's 1am for hashar we'd be on our own...:D)
[00:31:52] <hashar>	 I just take it as a blessing that indeed we can do dancy suggestion: switch to fqdn
[00:33:05] <James_F>	 That might be simpler long-term.
[00:33:20] * James_F But definitely not a fix for hashar at 01:30. :-)
[00:33:32] <brennen>	 yeah, appreciate all the help, and sorry hashar :(
[00:33:41] <dancy>	 hashar!! +100!
[00:34:01] <James_F>	 OK, do we have an idea of if the reboot fixed things?
[00:34:30] <James_F>	 Single java thread on each agent == {{done}}?
[00:35:08] <bd808>	 https://integration.wikimedia.org/zuul/ has more green things than before at least
[00:35:17] <thcipriani>	 having multiple agents run the same job on each host sure would explain what we were seeing
[00:35:23] <James_F>	 Yes.
[00:35:45] <brennen>	 yeah, much more handily than Weird Filesystem Magic
[00:35:53] <brennen>	 i also had no idea that could happen
[00:36:07] <dancy>	 It can if you configure it to 
[00:36:27] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) The agent configuration were pointing to their WMCS instances IP. https://integration.wikimedia.org/ci/computer...
[00:36:29] <James_F>	 Computers are silly and will do exactly what you told them to, even when they shouldn't.
[00:36:35] <dancy>	 dummies!
[00:37:15] <wikibugs>	 10Continuous-Integration-Infrastructure: Switch CI agent config from IPs to FQDN now it's possible - https://phabricator.wikimedia.org/T300224 (10Jdforrester-WMF)
[00:37:25] <bd808>	 I think there is a separate puppet config issue that is causing failures like https://integration.wikimedia.org/ci/job/publish-to-doc/6703/console (rsync not in PATH on integration-agent-docker-1024.integration.eqiad1.wikimedia.cloud.)
[00:37:28] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) p:05High→03Medium Solved by team work over IRC `#wikimedia-releng` with bunch of nice suggestions, leads an...
[00:37:29] <James_F>	 There, we have a follow-up task. It's all professional and everything.
[00:37:34] <hashar>	 I wrote the summary
[00:37:57] <hashar>	 I think bd808 found the actual root cause which is that the agent have the IP changed
[00:38:15] <hashar>	 cause I have created them by copying the conf from another agent,  update the IP then save
[00:38:26] <hashar>	 but I guess as soon as I create the agent Jenkins immediately connect
[00:38:45] <hashar>	 and when the IP is changed and the agent is actually saved, Jenkins has already connected on the IP of the other agent
[00:38:58] <hashar>	 which would be the rationale 2am explanation
[00:39:20] <dancy>	 hashar: Mad respect for your dedication
[00:39:26] <hashar>	 yeah well
[00:39:32] <hashar>	 I am really too tired
[00:39:33] <bd808>	 post-mortems are best done after sleep :)
[00:39:41] <brennen>	 go to bed hashar. :)
[00:39:42] <James_F>	 OK, is it time to try to land the wmf.19 patch? (And let's switch back to -operations.)
[00:39:45] <hashar>	  Idon't even understand how I managed to figure out all those bits or write all those sentences
[00:39:51] <dancy>	 oh, we call them "outros" now
[00:39:59] <thcipriani>	 ^
[00:40:04] <dancy>	 post-mortem is too triggering
[00:40:05] <thcipriani>	 I don't like the term postmortem
[00:40:17] <hashar>	 so the most likely explanation is that when build the Stretch instances 2.5 years ago the exact same issue must have happened
[00:40:19] <thcipriani>	 or after-action ... whatever
[00:40:25] <James_F>	 pre-vivation?
[00:40:33] <dancy>	 Makes one think of an autopsy and Dr Baden
[00:40:34] <thcipriani>	 I picked the term disquisition
[00:40:44] <hashar>	 and my brain memory just got me on auto pilot redoing the same debug steps
[00:41:01] <thcipriani>	 https://www.merriam-webster.com/dictionary/disquisition
[00:41:06] <hashar>	 :D
[00:41:18] <James_F>	 Fancy.
[00:41:31] <hashar>	 thx for all the support / debug etc!  if one can drop a quick note to wikitech-l to explain the failures that would be very kind
[00:42:04] <zabe>	 thx for the quick mitigation :)
[00:42:17] <hashar>	 :-]
[00:42:42] <hashar>	 and I really appreciate how many folks step in when something explodes. That is always great to watch
[00:43:18] <hashar>	 80% sure there is a task from 2.5 years ago about it
[00:43:43] <hashar>	 there is no chance I can have done all that debugging above at this time of the day
[00:43:56] <hashar>	 so I must have done the exact same debugging step a while ago
[00:44:01] <hashar>	 but can't remember about it cause I am tired
[00:44:04] <hashar>	 pff
[00:44:15] * hashar heads to bed  
[00:44:31] <bd808>	 I think hashar is in denial about how the deep structures of his brain have been warped by running the CI stack for N>9 years
[00:44:45] <hashar>	 yeah that as well :D
[00:45:30] <hashar>	 will do post mortem tomorrow I guess
[00:45:55] <hashar>	 🌊
[00:46:03] <thcipriani>	 goodnight
[00:46:17] * James_F thcipriani: BTW, I saw you made train tasks out to wmf.28, so does that mean you expect the REL1_38 branch to be cut at the end of March not the 15th of March?
[00:46:42] * James_F (What, me? Phab-stalking things? Never!)
[00:46:44] <thcipriani>	 :D
[00:47:14] <thcipriani>	 James_F: it seems you've thought harder than me about the train tasks :)
[00:47:18] * James_F grins.
[00:47:45] <James_F>	 Only that "two releases a year" means 52/2 = 26 weekly alpha cuts. 
[00:47:50] <James_F>	 High-end maths, that is.
[00:48:21] <thcipriani>	 my task creation should not be viewed as a change to that policy
[00:48:27] <James_F>	 Ack.
[00:49:26] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071 (10bd808)
[00:49:28] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10bd808)
[00:50:26] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10bd808)
[00:50:49] * James_F thcipriani: I can't edit those magic tasks to point them at 1.39.x unfortunately.
[00:51:10] <James_F>	 Have created https://www.mediawiki.org/wiki/MediaWiki_1.39/Roadmap for transparency.
[01:13:26] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10bd808)
[01:18:16] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31415: 04STILL FAILING in 13 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31415/
[01:24:25] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10bd808) This might be caused by the Bullseye base image being more bare bones than prior Debian base ima...
[02:15:01] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31416: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31416/
[03:10:27] <wmf-insecte>	 Project mwcore-phpunit-coverage-master build #1915: 04FAILURE in 10 min: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/1915/
[03:15:36] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31417: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31417/
[03:53:11] <wmf-insecte>	 Project mediawiki-core-phpmetrics-docker build #1160: 04FAILURE in 4 min 10 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-phpmetrics-docker/1160/
[04:16:00] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31418: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31418/
[05:15:21] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31419: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31419/
[06:14:47] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31420: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31420/
[07:11:20] <wikibugs>	 (03PS8) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411
[07:14:41] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31421: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31421/
[07:15:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan)
[08:00:46] <wmf-insecte>	 Project Wikibase-phpmetrics-docker build #479: 04FAILURE in 45 sec: https://integration.wikimedia.org/ci/job/Wikibase-phpmetrics-docker/479/
[08:16:16] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31422: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31422/
[08:20:19] <wikibugs>	 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10Ladsgroup)
[08:35:11] <hashar>	 pff
[08:38:25] <hashar>	 bash: line 1: rsync: command not found
[08:38:27] <hashar>	 fun
[08:40:49] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar)
[08:40:58] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) p:05Triage→03Unbreak!
[08:41:45] <wikibugs>	 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058 (10JMeybohm) `scap pull` and restbase dummy deploy seemed fine.
[08:42:18] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) ` integration-cumin:~$ sudo cumin --force 'name:docker' 'which rsync' 18 hosts will be targeted: integration-agent-docker-[1...
[08:57:22] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) From the host that works: ` # aptitude why rsync i   git-fat Depends rsync `  However `git-fat` is no more available on Bull...
[09:05:20] <hashar>	 !log integration: cumin --force 'name:docker' 'apt install rsync'  # T300214
[09:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[09:05:21] <stashbot>	 T300214: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214
[09:07:34] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) 05Open→03Resolved a:03hashar
[09:07:37] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071 (10hashar)
[09:16:28] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-brennen: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214 (10hashar) 05Resolved→03Open
[09:16:31] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye - https://phabricator.wikimedia.org/T252071 (10hashar)
[09:16:54] <hashar>	 !log integration: cumin --force 'name:docker' 'apt install rsync'  # T300236
[09:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[09:16:56] <stashbot>	 T300236: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236
[09:17:31] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) 05Open→03Resolved
[09:19:19] <wmf-insecte>	 Yippee, build fixed!
[09:19:20] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31423: 09FIXED in 15 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31423/
[09:23:20] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10hashar) Sorry I have missed this task this morning. While reading the IRC backscroll I have noticed `me...
[09:23:43] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar)
[09:23:47] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 (10hashar)
[09:26:29] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: publishing to doc.wikimedia.org fails: rsync command not found - https://phabricator.wikimedia.org/T300236 (10hashar) From https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/buildTimeTrend  {F34932576 size=full}  The build...
[09:56:29] <wikibugs>	 10Release-Engineering-Team: Should UI regressions (e.g. no fatals) with the non-default skins ever block the train? - https://phabricator.wikimedia.org/T300169 (10TheDJ) As far as I can tell, "not-supported" has always been a statement to reduce cognitive load on developers when developing new features and not s...
[10:07:43] <wikibugs>	 (03PS9) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411
[10:11:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan)
[10:27:36] <wikibugs>	 10Release-Engineering-Team: Should UI regressions (e.g. no fatals) with the non-default skins ever block the train? - https://phabricator.wikimedia.org/T300169 (10hashar) The status quo as I understand it is that Vector is the prime skin and the others ones are support on a best effort basis (we would fix things...
[10:35:02] <wikibugs>	 (03PS1) 10Kosta Harlan: layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015)
[10:37:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) (owner: 10Kosta Harlan)
[10:39:42] <wikibugs>	 (03PS2) 10Kosta Harlan: layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015)
[11:21:51] <kostajh>	 hi hashar ^ should be ready to deploy, if you have some time. It's possible the GrowthExperiments tests will start failing though, so please lmk when you are able to deploy that so I can be around to fix things as needed
[11:34:58] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Puppet failures on integration-agent-qemu hosts - https://phabricator.wikimedia.org/T299836 (10Majavah)
[11:35:07] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Puppet failure on integration-agent-qemu* - https://phabricator.wikimedia.org/T299996 (10Majavah)
[11:38:57] <taavi>	 hashar: "I don't even know to want the glory details that makes the DNS resolution possible from the host" simply just that we now use a publicly registered domain (wikimedia.cloud), it even works on your laptop
[11:43:07] <taavi>	 bd808: integration is also hooked to the prometheus stack, but there aren't any special alerting rules (just standards like puppet failures and instances being down) and the alerts only go to #wikimedia-cloud-feed
[13:06:20] <wikibugs>	 (03PS10) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411
[13:11:15] <wikibugs>	 (03PS1) 10QChris: Allow “Gerrit Managers” to import history [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757654
[13:11:17] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757654 (owner: 10QChris)
[13:11:19] <wikibugs>	 (03PS1) 10QChris: Import done. Revoke import grants [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757655
[13:11:21] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [extensions/CategoryExplorer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/757655 (owner: 10QChris)
[13:37:14] <wikibugs>	 (03CR) 10Awight: [C: 03+1] "Looks like it should give a small speed gain: the three steps might even be complementary in using cpu, network, and disk, but unfortunate" [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan)
[13:39:30] <wikibugs>	 (03CR) 10Kosta Harlan: [WIP] Run post-dependency install, pre-test steps in parallel (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (owner: 10Kosta Harlan)
[14:26:02] <wikibugs>	 (03PS11) 10Kosta Harlan: Run post-dependency install, pre-test steps in parallel [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (https://phabricator.wikimedia.org/T225730)
[14:44:27] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) (owner: 10Kosta Harlan)
[14:45:46] <hashar>	 kostajh: I am deploying the patch  to trigger apitests on GrowthExperiments
[14:46:18] <wikibugs>	 (03Merged) 10jenkins-bot: layout: Enable apitests for GrowthExperiments [integration/config] - 10https://gerrit.wikimedia.org/r/757625 (https://phabricator.wikimedia.org/T253015) (owner: 10Kosta Harlan)
[14:56:31] <kostajh>	 ok, thanks hashar 
[14:56:44] <hashar>	 kostajh: it is deployed :]
[14:57:23] <kostajh>	 cheers!
[14:57:25] * kostajh watches https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php72-docker/26841/console
[15:04:18] <kostajh>	 hashar: it's semi-broken, but let me see if I can fix that quickly without undeploying the jjb patch
[15:06:51] <hashar>	 eek
[15:07:06] <hashar>	 might have been better to deploy the CI change first then recheck the patch on the extension to confirm it works
[15:07:18] <hashar>	 hopefully it is easy to fix?
[15:10:55] <wmf-insecte>	 Yippee, build fixed!
[15:10:55] <wmf-insecte>	 Project mwcore-phpunit-coverage-master build #1916: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/1916/
[15:11:51] <kostajh>	 hashar: yeah I think it will be
[15:12:05] <kostajh>	 (famous last words)
[15:14:05] <addshore>	 kostajh: do you midn if I remove the large copyright things from mwcli?
[15:14:09] <addshore>	 *mind
[15:14:52] <addshore>	 I think it will make it a little easier to read some of the text and docs explaining things in the files :)
[15:15:14] <kostajh>	 addshore: fine by me
[15:16:21] <addshore>	 cool!
[15:37:36] <wikibugs>	 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.2.1 - https://phabricator.wikimedia.org/T300058 (10dancy)
[15:37:45] <wikibugs>	 10Release-Engineering-Team, 10Scap: scap overrides for deploy-local using -D parameter fail - https://phabricator.wikimedia.org/T300177 (10dancy) 05In progress→03Resolved This problem should be fixed now.
[15:37:54] <dancy>	 tzags.
[15:42:45] <wikibugs>	 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10cscott) @Krinkle found a missed case for the ParserOutput::addModules() deprecation mentioned above, with a patch in http...
[15:57:25] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [integration/config] - 10https://gerrit.wikimedia.org/r/755816 (https://phabricator.wikimedia.org/T299431) (owner: 10Cwhite)
[16:00:29] <hashar>	 !log Pooling back agents 1035 1036 1037 1038 , they could not connect due to ssh host mismatch since yesterday they all got attached to instance 1033 and accepted that host key # T300214
[16:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[16:00:31] <stashbot>	 T300214: 'No such file or directory' CI failures in multiple repos - https://phabricator.wikimedia.org/T300214
[16:40:46] <Lucas_WMDE>	 strange build behavior: https://integration.wikimedia.org/ci/job/mwgate-node12-docker/74935/console has apparently made no progress since 16:24:38 UTC (ca. 15 mins ago)
[16:45:51] <dancy>	 It finally failed.
[16:46:09] <dancy>	 hmm. aborted.. perhaps that was you
[16:46:13] <Lucas_WMDE>	 not me
[16:46:21] <Lucas_WMDE>	 ah, Node OOM
[16:46:33] <Lucas_WMDE>	 I’ve seen those a few times but not usually in this repository I think
[16:46:44] <Lucas_WMDE>	 and I hadn’t noticed before that they have such a long window of no output before them
[16:49:49] <Lucas_WMDE>	 interesting, I think I’m also getting increasing memory usage when I run those tests locally
[16:50:21] <Lucas_WMDE>	 though in a slightly different place in the tests
[16:54:07] <Lucas_WMDE>	 I abandoned the corresponding change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/726557 for now
[16:54:59] <dancy>	 Alright. Good luck!
[16:58:15] <Lucas_WMDE>	 thanks!
[17:45:54] <wikibugs>	 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10Urbanecm)
[17:48:10] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap sync-world: Make wikiversions-compile phase verbose [tools/scap] - 10https://gerrit.wikimedia.org/r/757693
[17:49:05] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] scap sync-world: Make wikiversions-compile phase verbose [tools/scap] - 10https://gerrit.wikimedia.org/r/757693 (owner: 10Ahmon Dancy)
[17:49:45] <wikibugs>	 (03Merged) 10jenkins-bot: scap sync-world: Make wikiversions-compile phase verbose [tools/scap] - 10https://gerrit.wikimedia.org/r/757693 (owner: 10Ahmon Dancy)
[18:57:46] <wikibugs>	 10Release-Engineering-Team (Next), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 (10brennen)
[19:13:20] <wikibugs>	 (03CR) 10Ladsgroup: "It doesn't have a ticket nor explains why in the commit message. Is it not needed? https://phabricator.wikimedia.org/macro/view/6/" [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah)
[19:14:50] <wikibugs>	 (03PS2) 10Majavah: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency [integration/config] - 10https://gerrit.wikimedia.org/r/757464
[19:15:09] <wikibugs>	 (03CR) 10Majavah: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah)
[19:20:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Awesome. Do you want me to deploy it?" [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah)
[19:22:31] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31433: 04FAILURE in 18 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31433/
[19:27:43] <wikibugs>	 (03CR) 10Majavah: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah)
[19:31:23] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "deploying" [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah)
[19:33:46] <wikibugs>	 (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/WikimediaIncubator] Drop CentralAuth dependency [integration/config] - 10https://gerrit.wikimedia.org/r/757464 (owner: 10Majavah)
[19:34:40] <Amir1>	 !log Reloading Zuul to deploy 757464
[19:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[19:40:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] logstash-filter-verifier: upgrade logstash to 7.16 [integration/config] - 10https://gerrit.wikimedia.org/r/755816 (https://phabricator.wikimedia.org/T299431) (owner: 10Cwhite)
[19:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: logstash-filter-verifier: upgrade logstash to 7.16 [integration/config] - 10https://gerrit.wikimedia.org/r/755816 (https://phabricator.wikimedia.org/T299431) (owner: 10Cwhite)
[19:45:14] <wikibugs>	 10Release-Engineering-Team, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10brennen)
[19:46:05] <wikibugs>	 10Release-Engineering-Team, 10Infrastructure-Foundations, 10Puppet, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10brennen)
[20:00:43] <wikibugs>	 (03PS1) 10Cwhite: logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736
[20:01:04] <wikibugs>	 (03PS2) 10Cwhite: logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736
[20:10:51] <wikibugs>	 10GitLab (CI & Job Runners), 10Security Team AppSec, 10Security-Team, 10Security, 10user-sbassett: Add minimal yaml file linting as part of ci for security ci templates repository - https://phabricator.wikimedia.org/T294596 (10sbassett) A quick blog post that could be helpful to some future work in this...
[20:11:48] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736 (owner: 10Cwhite)
[20:13:33] <wikibugs>	 (03Merged) 10jenkins-bot: logstash-filter-verifier: work around ca-certificates-java install error [integration/config] - 10https://gerrit.wikimedia.org/r/757736 (owner: 10Cwhite)
[20:18:53] <wmf-insecte>	 Yippee, build fixed!
[20:18:54] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #31434: 09FIXED in 14 min: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/31434/
[20:26:44] <hashar>	 !log Successfully published image docker-registry.discovery.wmnet/releng/logstash-filter-verifier:0.0.2  # T299431
[20:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[20:26:46] <stashbot>	 T299431: Upgrade logstash-filter-verifier logstash to 7.16 - https://phabricator.wikimedia.org/T299431
[21:03:00] <wikibugs>	 (03PS1) 10Hashar: jjb: update logstash-filter-verifier [integration/config] - 10https://gerrit.wikimedia.org/r/757746 (https://phabricator.wikimedia.org/T299431)
[21:03:24] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Job updated!" [integration/config] - 10https://gerrit.wikimedia.org/r/757746 (https://phabricator.wikimedia.org/T299431) (owner: 10Hashar)
[21:05:10] <wikibugs>	 (03Merged) 10jenkins-bot: jjb: update logstash-filter-verifier [integration/config] - 10https://gerrit.wikimedia.org/r/757746 (https://phabricator.wikimedia.org/T299431) (owner: 10Hashar)
[21:23:24] <zabe>	 kostajh: hey, could you resubmit https://gerrit.wikimedia.org/r/c/mediawiki/core/+/753556? It was conflicting with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/735718. All conflicts should be fixed now.
[21:24:29] <kostajh>	 Ok will do
[21:25:35] <zabe>	 thx
[21:43:45] <wikibugs>	 (03CR) 10Krinkle: "How does parallel handle output, does it buffer them internally and then flush as whole chunks whenever one of them finishes? Or continous" [integration/quibble] - 10https://gerrit.wikimedia.org/r/757411 (https://phabricator.wikimedia.org/T225730) (owner: 10Kosta Harlan)
[21:55:41] <wikibugs>	 (03PS1) 10Ahmon Dancy: sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759
[22:00:40] <wikibugs>	 (03PS2) 10Ahmon Dancy: sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759
[22:01:25] <wikibugs>	 (03CR) 10Ahmon Dancy: "I prepared this commit after a discussion with Timo today." [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 (owner: 10Ahmon Dancy)
[22:04:18] <wikibugs>	 (03PS3) 10Krinkle: sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 (owner: 10Ahmon Dancy)
[22:04:39] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] sync-world: Change handling of wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/757759 (owner: 10Ahmon Dancy)