[00:53:08] (03PS1) 104nn1l2: Add 4nn1l2 to the CI whitelist [integration/config] - 10https://gerrit.wikimedia.org/r/741985 [01:12:31] 10Phabricator, 10Release-Engineering-Team, 10serviceops: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Hawkeye7) HTTPS is for web pages. I am dealing with software, not web pages. I thought Wikimedia had moved off gerrit and onto gitlab. (https://news... [01:38:44] 10Phabricator, 10Release-Engineering-Team, 10serviceops: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Reedy) >>! In T296022#7530265, @Hawkeye7 wrote: > HTTPS is for web pages. I am dealing with software, not web pages. It really doesn't matter. git o... [03:36:18] 10Phabricator, 10Release-Engineering-Team, 10serviceops: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Hawkeye7) It seems that I do have a gitlab.wikimedia.org account, so the project could be moved there, although it says that it is still under constru... [06:20:25] Project beta-update-databases-eqiad build #54853: 04FAILURE in 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54853/ [06:23:00] /usr/local/bin/mwscript: line 26: 13477 Segmentation fault ??? [07:07:39] !log hard reboot deployment-mwmaint02 [07:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [07:09:14] Project beta-scap-sync-world build #28427: 04FAILURE in 24 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28427/ [07:13:16] Yippee, build fixed! [07:13:16] Project beta-scap-sync-world build #28428: 09FIXED in 2 min 36 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/28428/ [07:19:00] good morning [07:20:26] Project beta-update-databases-eqiad build #54854: 04STILL FAILING in 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54854/ [07:40:01] " [07:40:37] who maintains/owns the devtools cloud vps project? I'd like to test a puppet patch there [08:11:25] bonjour hashar [08:15:46] 10Phabricator, 10Release-Engineering-Team, 10serviceops: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Majavah) >>! In T296022#7530284, @Hawkeye7 wrote: > I gave it my Phabricator userid and password, but got an error Which password were you using? You... [08:18:30] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10kostajh) [08:19:29] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10kostajh) @zeljkofilipin any idea what might be going on here? I don't think we changed anything in our t... [08:20:12] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10kostajh) p:05Triage→03High [08:20:23] Project beta-update-databases-eqiad build #54855: 04STILL FAILING in 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54855/ [08:49:31] kostajh: hi ) [08:50:51] majavah: devtools got setup by Daniel Zahn and others to let us setup our development tooling suite (phabricator / gerrit etc) [08:51:12] kostajh: wdio is dieing ? :(\ [08:51:52] hashar: yeah, i haven't seen this particular error before. [08:52:14] I don't think the CI image got rebuild any recently [08:52:27] so I guess it would be related to a new nodejs module being released [08:52:53] hashar: can I take that as "you can test things there as long as you fix everything you break"? [08:54:15] I want to test the doc.wm.o migration sync patch [08:54:35] majavah: sounds correct yes [08:54:42] hashar: looks like the site can't be reached? https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/124732/artifact/log/Homepage-Shows-a-suggested-edits-card-and-allows-navigation-forwards-and-backwards-through-queue-2021-11-26T07-06-52-470Z.mp4 [08:54:52] for doc.wm.o I cant look at it. Gotta dig into the exact flow being used [08:55:02] maybe there is no ssh host key verification involved at all [08:55:24] kostajh: which would implies php -S has an issue of some sort :\ [08:55:37] usually I pick the last good and first bad builds and compare the console output [08:55:55] no idea though why php would close the socket though [08:56:02] time to switch to apache :P [08:56:03] maybe it timesout [09:20:22] Project beta-update-databases-eqiad build #54856: 04STILL FAILING in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54856/ [09:33:08] meeting done [09:45:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/741713/ works as expected on devtools [09:53:48] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10hashar) I have compared the last two builds of that job for the change https://gerrit.wikimedia.org/r/c/... [09:53:50] kostajh: I compared the last good build vs the failling build and I can't find anything :-\ [10:06:50] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10kostajh) One thing of interest is that the `quibble-vendor-mysql-php72-selenium-docker` job which runs G... [10:07:56] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, 10ci-test-error: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10hashar) ` 07:05:57 Execution of 2 spec files started at 2021-11-26T07:05:57.049Z 07:05:57 07:05:57 [0-0] RUNNING in chr... [10:08:00] kostajh: i have found something related to lot of MessageCache::loadFromDB queries [10:08:24] the php web server did handle the request properly but I guess it took more than 30 seconds to reply [10:08:33] so wdio considered it stall and aborts [10:16:10] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, 10ci-test-error: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10hashar) For the good build https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/124603/artifact/lo... [10:16:12] I don't have much to say beside what I wrote :-\ [10:20:22] Project beta-update-databases-eqiad build #54857: 04STILL FAILING in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54857/ [10:23:17] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, 10ci-test-error: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10hashar) The first debug line that differs is the bad debug log has a call to `MediaWiki\User\ActorStore::getUserIdentity... [10:26:34] oh [10:26:42] and we have a segmentation fault in update.php [10:26:48] will file that as a task [10:35:57] 10Continuous-Integration-Config: Add PHP 8.1 for PHP extensions CI - https://phabricator.wikimedia.org/T293509 (10Jdforrester-WMF) [10:35:59] 10Continuous-Integration-Config, 10PHP 8.1 support: Create experimental PHP 8.1 images - https://phabricator.wikimedia.org/T296489 (10Jdforrester-WMF) [11:20:35] Project beta-update-databases-eqiad build #54858: 04STILL FAILING in 35 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54858/ [12:20:22] Project beta-update-databases-eqiad build #54859: 04STILL FAILING in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54859/ [12:29:30] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 2 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10zeljkofilipin) [12:36:36] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 2 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10dom_walden) I am seeing errors in the beta logs of the form: ` 2021-11-26 11:42:20 [YaDIDYZogQBJCA7hy0XS4QAAAAU] deployment-m... [13:20:22] Project beta-update-databases-eqiad build #54860: 04STILL FAILING in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54860/ [13:39:56] segfault? nice [13:41:30] yeah, I have no clue what's going on [14:12:22] I forgot to file a task about it bah [14:15:23] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10hashar) [14:15:55] !log deployment-prep: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ database updating job is broken since 6:20 UTC due to a segmentation fault | T296539 [14:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:15:58] T296539: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 [14:20:22] Project beta-update-databases-eqiad build #54861: 04STILL FAILING in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54861/ [14:23:07] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10hashar) When things break in this way, my usual suspect is unattended upgrade of Debian package which happens at 6:15 UTC via a cron job. On the instance deployment-de... [14:24:17] the only things I see is that libvips-tools and tidy got removed earlier in the week [14:24:29] and this morning at 6:15 bunch of packages related to those got removed as well [14:24:45] (cause of unattended upgrades which kick in on a daily based and does apt get upgrade or something [14:25:03] maybe php or one of our php extensions actually depends on those lib [14:25:08] and that would end up causing the seg fault [14:26:15] but that does not seem the case based on a ldd on all of the extensions .so [14:26:19] so who knows ... [14:30:11] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10Majavah) Can reproduce without the wmf-beta-update-databases wrapper: ` taavi@deployment-deploy01:~$ mwscript update.php --wiki aawiki --quick #!/usr/bin/env php Warnin... [14:31:38] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10Majavah) cc @legoktm for the vips findings above, no unattended-updates on production but letting you know just in case [14:37:04] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10hashar) Running update.php from the command line, the script stalls after the line: ` name=deployment-deploy01.deployment-prep.eqiad.wmflabs $ mwscript update.php --wik... [14:38:48] reboot the machine? :P [14:39:48] unlikely ;D [14:39:57] I got some backtrace using gdb [14:40:02] who knows really :-\ [14:40:08] I am afraid that will have to wait for monday [14:41:08] Reedy: at least two machines with the same problem :/ [14:41:19] nice [14:41:32] please add any findings you have on the task ;) [14:41:35] deploy01 (stretch) and deploy03 (buster) [14:41:40] I did add everything I found so far [14:41:46] maybe the php packages have some issue [14:45:41] It wouldn't be a complete surprised [15:20:25] Project beta-update-databases-eqiad build #54862: 04STILL FAILING in 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54862/ [15:52:30] week-end time & [15:53:57] Why has update-databases been segfaulting all day [15:54:21] T296539 [15:54:21] 05:20 last success [15:54:21] T296539: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 [15:55:29] Ah [16:20:22] Project beta-update-databases-eqiad build #54863: 04STILL FAILING in 22 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54863/ [16:23:59] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 3 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10kostajh) >>! In T296508#7530794, @dom_walden wrote: > I am seeing errors in the beta logs of the form: > ` > 2021-11-26 11:42... [16:27:17] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 3 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10kostajh) >>! In T296508#7530570, @hashar wrote: > The first debug line that differs is the bad debug log has a call to `Media... [16:40:55] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 3 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10thiemowmde) I don't know if this is related, but the comments on [the patch that broke my local dev environment](https://gerr... [16:51:44] 10Release-Engineering-Team (Next), 10Release, 10Train Deployments: 1.38.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T293952 (10hashar) [16:51:46] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 3 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10hashar) [16:52:54] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 3 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10hashar) > Original exception: [8f5bf9e7e6dc4746f4a1bbb9] /index.php/Main_Page Error: Maximum function nesting level of '256'... [16:54:00] 10Release-Engineering-Team (Next), 10Release, 10Train Deployments: 1.38.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T293952 (10hashar) [16:54:02] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10hashar) [16:54:49] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10hashar) p:05Triage→03Unbreak! That might be related to the infinite loop listed at T296508#7531413 I have marked it a train blocker thus it is now unbreak now prio... [16:58:08] 10Release-Engineering-Team, 10Growth-Team, 10GrowthExperiments, 10Browser-Tests, and 3 others: ERROR @wdio/sync: Error: socket hang up - https://phabricator.wikimedia.org/T296508 (10Pchelolo) Yup, reverting the patch mentioned by @thiemowmde will certainly fix this. I'll re-do the patch slightly differentl... [17:02:13] 10Beta-Cluster-Infrastructure: deployment-prep automatic update.php fails with Segmentation Fault - https://phabricator.wikimedia.org/T296539 (10hashar) @Majavah sorry I have missed your gdb comment cause I was editing the task at the same time you have send your message. The trick I found for the www-data acco... [17:10:56] RhinosF1: majavah: Reedy: looks like the root cause of the segfault has been found :) [17:11:10] Yep [17:11:10] let's see if the revert affects anything [17:11:10] something about infinite loop while looking up message [17:11:17] which also affect the issue kostajh had earlier today [17:11:19] \o/ [17:11:26] I am now really off for the week-end! [17:11:55] thanks for debugging hashar! have a nice weekend [17:20:23] Project beta-update-databases-eqiad build #54864: 04STILL FAILING in 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54864/ [17:20:37] computer says no? [17:21:19] revert is still in jerkins [17:50:36] Yippee, build fixed! [17:50:36] Project beta-update-databases-eqiad build #54865: 09FIXED in 1 min 35 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/54865/ [20:19:10] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:40] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook