[00:46:42] (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [00:55:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:01:42] (Queue (Jenkins jobs + Zuul functions) alert) resolved: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [06:43:13] 10Release-Engineering-Team, 10Projects-Cleanup: Archive or delete wikidata/query/flink-swift-plugin - https://phabricator.wikimedia.org/T314273 (10hashar) [06:45:20] 10GitLab (Administration, Settings & Policy), 10Phabricator, 10Release-Engineering-Team, 10User-brennen: Create a form for tracking administrative privilege requests for GitLab, Phabricator, Gerrit, etc.? - https://phabricator.wikimedia.org/T314495 (10hashar) For Gerrit requests are filed against #gerrit-p... [06:58:50] (03CR) 10Hashar: "Deployed using scap from deployment.eqiad.wmnet in /srv/deployment/integration/docroot" [integration/docroot] - 10https://gerrit.wikimedia.org/r/821250 (https://phabricator.wikimedia.org/T309872) (owner: 10Michael Große) [07:04:17] 10Release-Engineering-Team, 10Projects-Cleanup: Archive or delete wikidata/query/flink-swift-plugin - https://phabricator.wikimedia.org/T314273 (10Aklapper) Please see the checklist template linked from https://phabricator.wikimedia.org/project/profile/2829/ [09:50:07] (03CR) 10Hashar: dockerfiles: update commit-message-validator (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/823193 (https://phabricator.wikimedia.org/T315159) (owner: 10BryanDavis) [09:51:09] 10Continuous-Integration-Config, 10Patch-For-Review, 10User-bd808: Update CI for commit-message-validator 1.0.0 - https://phabricator.wikimedia.org/T315159 (10hashar) a:03bd808 [09:51:14] 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10PHP 8.0 support, 10ci-test-error: CI job mediawiki-quibble-composer-mysql-php80-docker on mediawiki/core gate-and-submit is flaky failing with Segmentation fault - https://phabricator.wikimedia.org/T315167 (10hashar) [09:51:18] 10Continuous-Integration-Config, 10PHP 8.0 support: Make PHP 8.0 voting on MW master - https://phabricator.wikimedia.org/T300463 (10hashar) [09:52:01] jnuche: thank you for the jenkins upgrades :-] [10:00:54] 10Release-Engineering-Team, 10Developer Productivity: Developer productivity: arm64 versions of CI docker images - https://phabricator.wikimedia.org/T315286 (10Aklapper) [10:04:45] 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10PHP 8.0 support, 10ci-test-error: CI job mediawiki-quibble-composer-mysql-php80-docker on mediawiki/core gate-and-submit is flaky failing with Segmentation fault - https://phabricator.wikimedia.org/T315167 (10hashar) [10:10:29] 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10PHP 8.0 support, 10ci-test-error: CI job mediawiki-quibble-composer-mysql-php80-docker on mediawiki/core gate-and-submit is flaky failing with Segmentation fault - https://phabricator.wikimedia.org/T315167 (10hashar) I have added a few more... [10:20:15] 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10PHP 8.0 support, 10ci-test-error: CI job mediawiki-quibble-composer-mysql-php80-docker on mediawiki/core gate-and-submit is flaky failing with Segmentation fault - https://phabricator.wikimedia.org/T315167 (10hashar) I think we should upgra... [10:22:01] so happy I wrote a blog post about debugging a segfault inside a container https://phabricator.wikimedia.org/phame/post/view/152/help_my_ci_job_fails_with_exit_status_-11/ ;) [10:22:12] turns out to have a lot of hints that are helpful [10:29:32] hashar: np, welcome back! [10:29:44] (03CR) 10Hashar: dockerfiles: Add optional tag argument to debug-image (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/823195 (owner: 10BryanDavis) [10:29:48] (03PS3) 10Hashar: dockerfiles: Add optional tag argument to debug-image [integration/config] - 10https://gerrit.wikimedia.org/r/823195 (owner: 10BryanDavis) [10:30:35] (03CR) 10Hashar: [C: 03+2] dockerfiles: Add optional tag argument to debug-image [integration/config] - 10https://gerrit.wikimedia.org/r/823195 (owner: 10BryanDavis) [10:32:36] (03Merged) 10jenkins-bot: dockerfiles: Add optional tag argument to debug-image [integration/config] - 10https://gerrit.wikimedia.org/r/823195 (owner: 10BryanDavis) [10:34:27] 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops, 10serviceops-collab, and 2 others: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin1001 for host gitlab2003.wikimedia.org with OS bullseye [10:51:23] 10Beta-Cluster-Infrastructure, 10Performance-Team: Upgrade deployment-mdb01 to Buster/Bullseye - https://phabricator.wikimedia.org/T301637 (10Aklapper) a:05dpifkeβ†’03None Removing inactive task assignee (please do so as part of offboarding processes). [11:02:46] 10Continuous-Integration-Config, 10LibUp, 10phan, 10Composer: Run phan as part of composer test, rather than in bespoke CI jobs - https://phabricator.wikimedia.org/T280990 (10Aklapper) [11:06:19] 10Phabricator-Bot-Requests: Phabricator bot for train-blockers tool - https://phabricator.wikimedia.org/T315256 (10taavi) 05Openβ†’03Resolved a:03thcipriani works fine, thanks! [11:09:20] 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops, 10serviceops-collab, and 2 others: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin1001 for host gitlab2003.wikimedia.org with OS bullseye com... [11:32:33] 10Deployments, 10Phabricator, 10Developer Productivity, 10User-MModell: Create a permalink which always redirects to the current week's train blocker task - https://phabricator.wikimedia.org/T207669 (10taavi) >>! In T207669#8143929, @dancy wrote: > @taavi How often does update.php run? Every 3 hours I thi... [12:23:53] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10hashar) [12:23:58] 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10netops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) [12:59:21] (03CR) 10Hashar: Rewrite "running tests" section in README.md (034 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/808320 (owner: 10Hashar) [12:59:35] (03PS2) 10Hashar: Rewrite "running tests" section in README.md [tools/scap] - 10https://gerrit.wikimedia.org/r/808320 [13:02:11] (03CR) 10Hashar: "I have addressed Ahmon hints and incorporated the conflicting changes made by Jaime in https://gerrit.wikimedia.org/r/c/mediawiki/tools/sc" [tools/scap] - 10https://gerrit.wikimedia.org/r/808320 (owner: 10Hashar) [13:23:27] 10MediaWiki-Releasing, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support, 10Patch-For-Review: Drop PHP 7.2 & 7.3 support from MediaWiki master branch, once Wikimedia production is on 7.4 - https://phabricator.wikimedia.org/T261872 (10Joe) [14:33:21] hashar: user was trying the "chck experimental" functionality for the puppet repo https://gerrit.wikimedia.org/r/c/operations/puppet/+/823650 [14:33:34] however i dosn;t seem to be triggering has something changed or am i missing something [14:33:39] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/ [14:36:48] (03PS1) 10Ollie Shotton: Revert "zuul: skip selenium for Wikibase repo/rest-api" [integration/config] - 10https://gerrit.wikimedia.org/r/823633 [14:41:23] hashar: possibly the job just got stuck https://integration.wikimedia.org/zuul/#q=experimental [14:44:01] jbond: checking [14:44:17] thanks [14:44:32] hmm it is not the only job more or less stuck [14:46:03] (03CR) 10Ollie Shotton: [C: 04-1] "We have decided not to skip jobs (see T307090). Please abandon this patch - I can't do it (Permissions? I didn't create it?). Thanks" [integration/config] - 10https://gerrit.wikimedia.org/r/811297 (https://phabricator.wikimedia.org/T307090) (owner: 10Hashar) [14:48:41] (03Abandoned) 10Hashar: zuul: skip the other selenium job for Wikibase repo/rest-api [integration/config] - 10https://gerrit.wikimedia.org/r/811297 (https://phabricator.wikimedia.org/T307090) (owner: 10Hashar) [14:49:14] Confirming I can run the job manually through jenkins [14:49:48] it is something in Zuul I think [14:50:51] or well Jenkins [14:53:25] jbond: it finally started somehow [14:53:57] looks like jobs were dead locked / waiting for something to be released [14:53:58] :-\ [14:54:34] hashar: ack ill try and use it a bit more over the next few days and see if its a persists, thanks [14:54:54] whatever was the issue, I don't think it was specific to that job [14:55:03] ack [15:00:56] thanks h.ashar [15:04:19] (03CR) 10Hashar: [C: 03+2] Handle offline nodes properly [integration/config] - 10https://gerrit.wikimedia.org/r/823188 (https://phabricator.wikimedia.org/T315106) (owner: 10Ahmon Dancy) [15:06:39] (03Merged) 10jenkins-bot: Handle offline nodes properly [integration/config] - 10https://gerrit.wikimedia.org/r/823188 (https://phabricator.wikimedia.org/T315106) (owner: 10Ahmon Dancy) [15:07:44] Thanks hashar! Welcome back for real this time! [15:10:35] (03CR) 10Hashar: [C: 03+2] "I have deployed the job, the 1028 agent is still offline (it is broken for some reason) but that should be enough to confirm this fix is w" [integration/config] - 10https://gerrit.wikimedia.org/r/823188 (https://phabricator.wikimedia.org/T315106) (owner: 10Ahmon Dancy) [15:11:21] (03CR) 10Ahmon Dancy: [C: 03+2] Rewrite "running tests" section in README.md [tools/scap] - 10https://gerrit.wikimedia.org/r/808320 (owner: 10Hashar) [15:16:30] (03Merged) 10jenkins-bot: Rewrite "running tests" section in README.md [tools/scap] - 10https://gerrit.wikimedia.org/r/808320 (owner: 10Hashar) [15:25:32] (03CR) 10Ahmon Dancy: [C: 03+2] "Tested in train-dev" [tools/scap] - 10https://gerrit.wikimedia.org/r/823197 (owner: 10Jeena Huneidi) [15:30:27] (03Merged) 10jenkins-bot: scap backport: improve performance of validation [tools/scap] - 10https://gerrit.wikimedia.org/r/823197 (owner: 10Jeena Huneidi) [15:32:05] a.ndrewbogott and d.caro have composed a new theme that we can all hum as we wait for CI to do it's magic: https://bash.toolforge.org/quip/wiRAp4IB8Fs0LHO5-1mS [15:38:41] :-]]] [16:11:00] 10Continuous-Integration-Config, 10LibUp, 10phan, 10Composer: Run phan as part of composer test, rather than in bespoke CI jobs - https://phabricator.wikimedia.org/T280990 (10Jdforrester-WMF) Shall we just do this, assuming we can get LibUp fixed? [16:12:23] (03CR) 10Jforrester: [C: 03+2] Drop archived wikibase-vuejs-components storybook (031 comment) [integration/docroot] - 10https://gerrit.wikimedia.org/r/821250 (https://phabricator.wikimedia.org/T309872) (owner: 10Michael Große) [16:15:42] (Queue (Jenkins jobs + Zuul functions) alert) firing: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [16:20:42] (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [16:26:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:35:09] 10Release-Engineering-Team (Doing), 10Release Pipeline (Blubber): Blubber: runuser permissions to node_modules/.cache - https://phabricator.wikimedia.org/T255434 (10wiese) I (author of the ticket) do not have stake in this and, I guess, neither has WMDE anymore. Maybe with the issue written down and linked to... [20:30:47] !log Repooled integration-agent-docker-1028 , it was mysteriously unreachable T315372 [20:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:30:50] T315372: integration-agent-docker-1028 is not reachable - https://phabricator.wikimedia.org/T315372 [20:31:01] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: integration-agent-docker-1028 is not reachable - https://phabricator.wikimedia.org/T315372 (10hashar) 05Openβ†’03Resolved a:03hashar According to https://horizon.wikimedia.org/project/instances/886e9ca7-a2bc-4bd3-9a71-ae743245fe6a/ It had... [20:32:13] dancy: I have rebooted the integration-agent-docker-1028 instance which was causing maintenance-disconnect-full-disks.groovy to fail [20:32:56] Thanks! [20:33:48] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: integration-agent-docker-1028 is not reachable - https://phabricator.wikimedia.org/T315372 (10bd808) Based on timing, this was likely related to [[https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/NZCUDU5I4K6TTWXVT3... [20:36:58] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: integration-agent-docker-1028 is not reachable - https://phabricator.wikimedia.org/T315372 (10hashar) Could have been. That got noticed last week due to a script failing T315106 and I am guessing the instance had been unreachable for a lot... [20:41:22] 10Release-Engineering-Team (Seen), 10Scap, 10Epic, 10Goal: Automate the Train - https://phabricator.wikimedia.org/T196515 (10dancy) [20:41:30] 10Release-Engineering-Team (🌱 Spring Cleaning β€” April 2022), 10Scap: Automate rebuild l10n cache for Train - https://phabricator.wikimedia.org/T245187 (10dancy) 05Openβ†’03Declined This is superseded by T310395 [20:42:45] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10TheresNoTime) [20:43:13] 10Release-Engineering-Team (Seen), 10Scap: Running "scap pull" on deploy1001 halts for 2 minutes, then reports a php-fpm error - https://phabricator.wikimedia.org/T246959 (10dancy) @Krinkle: I'm thinking the resolution to this ticket may be "don't run scap pull on the deploy server". What do you think? [20:43:19] ^demon, dancy: please be aware that beta is down and we don't have much idea yet as to why [20:43:45] Thx RhinosF1 [20:44:43] dancy: I have no idea if it's block train problem but assuming as conductors you wanted to know [20:44:51] Understood. [20:44:54] It's completely down as in timing out altogether [20:47:33] dancy: on beta too, is there a good way to tell people to check T315350 if they see it down [20:47:33] T315350: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 [20:48:03] hmm.. That's a good question [20:49:00] dancy: is that a wikitech email or SAL? [20:49:22] Both! [20:50:15] dancy: which SAL [20:50:17] mutante: just gonna merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/823636 and trace out what happens, basically? [20:50:31] I'll write an email [20:50:50] RhinoF1: Let's start with this channel's SAL [20:50:56] brennen: I just wanted to compile it and stare at the full catalog again [20:51:48] !log beta: is down see wikitech-l and https://phabricator.wikimedia.org/T315350 [20:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:54:21] T315351 wouldn't have b0rked deployment-cache-text06 would it..? [20:54:21] T315351: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 [20:54:57] 10Release-Engineering-Team (Seen), 10Scap: Integrate mwdebug staging as part of `scap sync-dir` - https://phabricator.wikimedia.org/T239373 (10dancy) `scap backport` has been implements some of the suggestions in this ticket. [20:55:21] TheresNoTime: very possibly. Maybe tag traffic. [20:55:24] https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/WT55MAC5P6X5QUHOKHTESKXQKPUBXU4Z/ [20:55:32] I'm gonna make it a subtask [20:55:34] And raise [20:55:56] 10Beta-Cluster-Infrastructure: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10RhinosF1) [20:56:03] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10RhinosF1) [20:56:44] 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Traffic, 10Puppet: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10RhinosF1) p:05Triageβ†’03Unbreak! Hi Traffic, this might be stopping beta coming back up (or a false alarm). Can you take... [20:57:02] TheresNoTime: go enjoy your evening [20:58:04] I am going for a bit [21:01:14] 10Scap: scap update-interwiki-cache failure results in multiple commits - https://phabricator.wikimedia.org/T230483 (10dancy) 05Openβ†’03Declined `scap update-interwiki-cache` does not attempt to perform git commits as of https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/757062 [21:01:56] 10Scap: Automating the swat deployment workflow - https://phabricator.wikimedia.org/T226682 (10dancy) 05Openβ†’03Declined Superseded by `scap backport` [21:05:10] 10Release-Engineering-Team (Seen), 10Scap: Compare /common in mediawiki-staging and /usr directories at the start of scap - https://phabricator.wikimedia.org/T224980 (10dancy) 05Openβ†’03Declined [21:09:10] 10Scap: Scap required manual 'git update-server-info' on first run - https://phabricator.wikimedia.org/T196046 (10dancy) 05Openβ†’03Declined Closing due to age. [21:10:39] 10Scap, 10SRE: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076 (10dancy) 05Openβ†’03Resolved a:03dancy Closing due to age. [21:12:11] 10Scap: scap and resourceloader l10n - https://phabricator.wikimedia.org/T168790 (10dancy) 05Openβ†’03Declined Closing due to age [21:13:04] 10Gerrit: Users with a different name in the cn field compared to uid field cannot use http auth - https://phabricator.wikimedia.org/T225308 (10Aklapper) Can this task be resolved, or declined, or is there more to do here? (Asking as tasks shouldn't remain stalled for years.) [21:13:16] Project beta-update-databases-eqiad build #60794: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/60794/ [21:14:00] 10Beta-Cluster-Infrastructure, 10Scap: Rewrite wmf-beta-update-databases.py as plugin - https://phabricator.wikimedia.org/T151519 (10dancy) 05Openβ†’03Resolved a:03dancy [21:17:33] 10Scap: Don't allow servers to randomly sync across DC - https://phabricator.wikimedia.org/T76658 (10dancy) 05Openβ†’03Resolved a:03dancy Scap selects rsync source servers efficiently today. [21:19:40] So `beta-update-databases-eqiad` is timing out its run (when it gets to wikifunctionswiki) [21:19:47] that's... interesting [21:22:07] Is it actually trying to finish a change? [21:23:56] brennen: see on https://puppet-compiler.wmflabs.org/pcc-worker1002/36765/phab2002.codfw.wmnet/index.html how it changes File[/etc/ssh/sshd_config] at the bottom? [21:24:55] and then how that "file { $sshd_config" is (only) inside class phabricator::vcs .. but how in my previous change I told it to skip that [21:25:02] Reedy: well manually doing `mwscript update.php --wiki=wikifunctionswiki --quick` gets about half way then stops. Normally wikifunctionswiki takes around ~7 minutes to update (whereas most others take seconds), so the majority of that 45 minute timeout is it sat on wikifunctionswiki [21:26:24] 10Beta-Cluster-Infrastructure, 10Epic: 502 errors on beta cluster - https://phabricator.wikimedia.org/T312253 (10Jdforrester-WMF) [21:26:28] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Jdforrester-WMF) [21:26:45] brenne: is it just because I do "Boolean $enable_vcs = undef," and if $enable_vcs { ?:p [21:27:57] * brennen looks [21:28:11] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10TheresNoTime) [21:28:42] errrr T315379 ^ is not right.... right? [21:28:43] T315379: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 [21:30:21] ori: co ordinate beta here [21:30:46] brennen: in https://puppet-compiler.wmflabs.org/pcc-worker1002/36765/phab2002.codfw.wmnet/change.phab2002.codfw.wmnet.pson the "enable_vcs" is "false" though in 2 places... but it acted as if was true [21:31:18] which included the "interface::alias" part which I don't want [21:33:11] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10dancy) https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453 (from last week) updates logspam to handle the new t... [21:36:14] i'm at a loss [21:36:39] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10TheresNoTime) >>! In T315379#8159573, @dancy wrote: > https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453 (from... [21:38:16] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10Zabe) > Is puppet running there regularly? Yes > The last Puppet run was at Tue Aug 16 20:38:24 UTC 2022 (55 minutes... [21:39:52] brennen: thanks for confirming it's not just me.. heh.. [21:40:29] gardening is a good career, right.. trees don't just decide to stop working [21:42:08] the usual backup job was alway zookeeper.. feed the real penguins and gnus [21:47:32] beta's puppetmaster is meant to be its own client too right? [21:48:16] because that's broken.. `certificate verify failed (certificate revoked): [certificate revoked for /CN=deployment-puppetmaster04.deployment-prep.eqiad.wmflabs]` [21:49:03] brennen: I think I need to stop having the "lvs::realserver" stuff. modules/role/manifests/phabricator.pp: include ::lvs::realserver [21:49:26] provided git-ssh is the only reason to have it [21:49:47] that would add the LVS service IP on loopback.. that makes sense [21:49:55] does not explain the sshd_config part.. but something [21:51:52] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10TheresNoTime) >>! In T315379#8159618, @Zabe wrote: >> Is puppet running there regularly? > > Yes > >> The last Puppe... [21:52:23] the include is in the role itself.. so can't just change it based on hiera lookup..sigh [22:05:01] Project beta-update-databases-eqiad build #60795: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/60795/ [22:20:54] dancy: seems like you know about /etc/helmfile-defaults/mediawiki/release/mwdebug-pinkunicorn.yaml ? [22:24:40] zabe: are you still trying to debug puppet on deployment-puppetmaster04? [22:25:14] TheresNoTime: or you? [22:25:31] ori: nope, have stopped [22:25:38] kind of [22:25:48] I try to understand why apache is not willing to start [22:25:54] AH00526: Syntax error on line 1 of /etc/apache2/conf-enabled/50-configmaster-port.conf [22:26:01] Cannot define multiple Listeners on the same IP:port [22:26:10] https://phabricator.wikimedia.org/T315379#8159650 was my last input ref. puppet [22:26:32] if anyone has branch delete rights in gerrit, i mistakenly created the `es7` branch in `mediawiki/vendor`, what i actually needed was an `es710` branch to match what exsits in the extension repositories. No big deal, but if someone has the rights to `git push origin --delete es7` in mediawiki/vendor it would help keep things clean. Nothing was ever merged to es7, it has the same commit [22:26:34] zabe: how did *that* happen o.o [22:26:34] as master branch [22:28:13] zabe: ports.conf also has a 'Listen 80' directive, and it gets included [22:28:43] mutante: yes. Unfortunately I'm not at my desk to address it. I'll work with j.oe tomorrow to clean it up [22:28:50] in the history of puppetizing apache we had a couple ways to avoid the ports.conf issues [22:30:11] dancy: all good. ack. just take it as "fyi, it failed because of permission issues with the file" [22:30:23] thanks [22:31:22] zabe: if it uses modules/httpd then the /etc/apache/ports.conf should be installed by puppet and not duplicate stuff [22:32:00] if it comes from distro.. then that conflict would be a common issue [22:32:30] like when it puppetizes the site but not also ports.conf [22:33:21] and that class also has an option to fall back to a default ports.conf or not. $remove_default_ports [22:33:48] there are multiple things that are broken but until we get the puppet git repo on /var/lib/git/operations/puppet deployment-puppetmaster04 in a good state we won't know if the issues we're debugging are already fixed in puppet [22:36:47] it's currently in state MERGING, with a conflict in modules/profile/manifests/etcd/v3.pp [22:38:36] the conflict is between https://gerrit.wikimedia.org/r/c/operations/puppet/+/668701 which was never merged in prod but had been cherry-picked in beta for the past year and https://gerrit.wikimedia.org/r/c/operations/puppet/+/822118 [22:40:17] 10Phabricator, 10Release-Engineering-Team (Bonus Level πŸ•ΉοΈ), 10serviceops, 10serviceops-collab, 10Patch-For-Review: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 (10Dzahn) @thcipriani So far I am expecting to copy /srv/repos and /srv/dumps from old to new phab servers. with... [22:40:23] 10Release-Engineering-Team, 10Scap: Scap backport: Notify on irc when change has been deployed to mwdebug - https://phabricator.wikimedia.org/T314613 (10jeena) 05Openβ†’03In progress a:03jeena [22:40:33] it does use ports.conf from modules /httpd/files/default-ports.conf, I don't think it should do so [22:41:19] sorry the conflict is https://gerrit.wikimedia.org/r/c/operations/puppet/+/820090 [22:43:57] ori: JBond is super responsive. If you just even CC him in a comment on Gerrit I am sure he will have something by tomorrow for the "real fix" [22:46:14] I don't think he broke anything, it's that the Puppet repo on Beta has a year-old patch cherry-picked locally, there was bound to be a conflict eventually [22:47:20] (Yea, All I meant is the part to comment if that cherry-pick makes sense) [22:52:23] ori, are you currently trying to rebase that commit? [22:53:41] zabe: the apache is a configmaster as in https://config-master.wikimedia.org/ not the puppetmaster. it just happens to be the same host in beta I guess [22:54:23] maybe it breaks because both roles are combined [22:54:32] since some commit in the past [22:54:44] which would be separate machines in prod [22:54:50] so you wouldn't notice it there [22:55:44] modules/profile/manifests/configmaster.pp vs puppetmaster [22:56:04] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10ori) The Puppet repo on `deployment-puppetmaster04:/var/lib/git/operations/puppet` is in MERGING state. There's an unre... [22:56:25] zabe, I'm not; I haven't touched the repo [22:59:07] is there a root task for 'beta cluster is broken'? [22:59:50] ori: https://phabricator.wikimedia.org/T215217 I guess [23:00:08] https://phabricator.wikimedia.org/project/view/497/ [23:00:10] T315350 is the one for today I think [23:00:11] T315350: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 [23:00:56] some progress: https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ made puppet on puppetmaster04 run again [23:01:25] zabe: nice! [23:01:58] zabe: did you fix the merge conflicts? I no longer see modules/profile/manifests/etcd/v3.pp as an umerged path [23:02:23] no, I just abort the git merge in order to apply test that patch [23:03:53] I'm not sure that's the best idea [23:04:37] for all we know all the configuration issues are the result of the local repo and origin having 12 and 279 different commits each [23:04:40] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10TheresNoTime) [2022-08-16T23:25:49Z, #wikimedia-releng, @Zabe] I try to understand why a... [23:04:52] 10Beta-Cluster-Infrastructure: (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10TheresNoTime) [23:04:57] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10TheresNoTime) [23:05:00] Project beta-update-databases-eqiad build #60796: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/60796/ [23:05:07] I don't think it's ideal to make the delta even bigger [23:05:54] I don't have a problem removing that patch again, without the rebase it's useless, I only wanted to test it whether it actually fixes the puppet failure. We can let it go through gerrit. [23:06:24] 10Beta-Cluster-Infrastructure: (Beta Cluster) Unexpected connection error communicating with Elasticsearch. Curl code: {curl_code} - https://phabricator.wikimedia.org/T315354 (10TheresNoTime) 05Openβ†’03Invalid Expected symptom of T315350 [23:06:29] 10Beta-Cluster-Infrastructure: (Beta cluster) Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded - https://phabricator.wikimedia.org/T315355 (10TheresNoTime) 05Openβ†’03Invalid Expected symptom of T315350 [23:06:31] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10TheresNoTime) [23:06:37] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10TheresNoTime) [23:06:48] (sorry for the noise) [23:07:23] can I give it a shot? I'd like to try removing https://gerrit.wikimedia.org/r/c/operations/puppet/+/668701 , rebasing, and then re-applying it [23:07:29] ^ zabe [23:07:45] sure [23:08:13] ok, doing [23:11:53] 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10cloud-services-team (Kanban): Two volumes not deleting/creating on deployment-prep - https://phabricator.wikimedia.org/T309659 (10TheresNoTime) 05Openβ†’03Resolved a:03taavi Volumes deleted [23:15:41] 10Beta-Cluster-Infrastructure: Beta cluster unreachable when interacting with sessions (Error: 502, Next Hop Connection Failed) - https://phabricator.wikimedia.org/T284279 (10TheresNoTime) 05Openβ†’03Resolved a:03TheresNoTime //This// resolved [23:15:43] 10Beta-Cluster-Infrastructure, 10Epic: 502 errors on beta cluster - https://phabricator.wikimedia.org/T312253 (10TheresNoTime) [23:20:36] > I'd like to try removing https://gerrit.wikimedia.org/r/c/operations/puppet/+/668701 , rebasing, and then re-applying it [23:21:19] ok I did that, I went with the upstream code for the conflicting lines, so I'm not sure that particular patch will work as intended [23:24:17] unfortunately deployment-cache-text06 is still hitting the duplicate declaration error [23:26:04] Failed to open TCP connection to deployment-puppetmaster04.deployment-prep.eqiad.wmflabs:8140 [23:26:08] so https://gerrit.wikimedia.org/r/c/operations/puppet/+/823762/ is still needed [23:26:08] I think reverting https://gerrit.wikimedia.org/r/c/operations/puppet/+/816806 locally should help [23:26:57] 10Phabricator, 10Release-Engineering-Team (Bonus Level πŸ•ΉοΈ), 10serviceops, 10serviceops-collab, 10Patch-For-Review: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 (10Dzahn) Other things in /srv on phab1001 are: ` 871M /srv/deployment 1.6G /srv/dumps 4.0K /srv/git.wikimedia... [23:28:48] 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Puppet: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10TheresNoTime) Introduced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/816806 ? `lang=diff diff --git a... [23:29:30] (oh, didn't see you mentioned it ori, concur ^) [23:30:35] ok created the revert (https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638), cherry-picked it on beta puppetmaster, also cherry-picked zabe's change [23:31:19] there's a catch-22 tho, puppet can't fix apache and depend on it at the same time [23:31:29] zabe: did you get apache running by editing manually? [23:32:31] yes, it's a bit hacky but I did not knew a better way [23:33:10] can you do it again? [23:34:10] sure [23:34:31] puppet ran \o/ [23:34:44] 10Phabricator, 10Release-Engineering-Team (Bonus Level πŸ•ΉοΈ), 10serviceops, 10serviceops-collab, 10Patch-For-Review: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 (10Dzahn) regarding the UIDs.. user 'phd' has a reserved UID of 498. per docs (https://wikitech.wikimedia.org/wi... [23:34:46] new failure on deployment-cache-text06: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Package[haproxy]' in parameter 'require' (file: /etc/puppet/modules/esitest/manifests/init.pp, line: 36) on node deployment-cache-text06.deployment-prep.eqiad.wmflabs [23:34:59] Using the shared WMCS puppet master as the puppet master for puppetmaster04 is one way to avoid those kinds of chicken/egg problems of messed up manifests on the puppet master itself in the future. [23:36:05] we used to recommend self-master setups, but time has shown them to be fragile. [23:37:44] ok, created another revert (https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639/) for cherry-picking locally [23:37:55] bd808: ack [23:39:05] new failure: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Could not find declared class ::esitest (file: /etc/puppet/modules/profile/manifests/cache/varnish/frontend/text.pp, line: 7, column: 5) on node [23:39:07] deployment-cache-text06.deployment-prep.eqiad.wmflabs [23:39:09] * ori cries [23:39:52] * bd808 makes consoling noises near ori [23:40:39] I guess that missing 'Package[haproxy]' is a sign that the deployment-prep CDN edge is out of date [23:41:10] I'm pretty sure that the prod CDN is using haproxy at the outside edge these days [23:41:41] `Since April 2022 (T290005) we use HAProxy for TLS and HTTP2 termination, Varnish for the in-memory cache ("frontend"), and Apache Traffic Server is responsible for on-disk persistent caching ("backend").` -- https://wikitech.wikimedia.org/wiki/Caching_overview [23:41:42] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [23:43:44] OK I hacked in locally `if $::realm != 'labs' {}` around the ::esitest in profile/manifests/cache/varnish/frontend/text.pp [23:43:51] I'll turn it into a patch shortly [23:44:03] puppet ran successfully on deployment-cache-text06 [23:45:21] woo [23:46:44] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Discovery-Search, 10Wikidata, and 2 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10TheresNoTime) [2022-08-17T00:43:52Z, #wikimedia-releng, @ori] OK I hacked in locally `if... [23:47:10] https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page still errors out [23:48:00] has apache (re)started? [23:49:51] what I've managed to find: it seems like mediawiki is working locally on the MediaWiki hosts. But I get timeouts on the -cache-text server. Traffic server is complaining about varnish-frontend, but varnish isn't reporting errors afaict. [23:49:51] yeah, `curl -i --connect-to ::$HOSTNAME -H 'X-Forwarded-Proto: https' 'http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page'` works on deployment-mediawiki12 [23:52:02] 10Release-Engineering-Team (Seen), 10Scap: Running "scap pull" on deploy1001 halts for 2 minutes, then reports a php-fpm error - https://phabricator.wikimedia.org/T246959 (10Krinkle) @dancy I'd be fine with that. Essentially what scap-pull would be for, apart from accidental command runs under confusion of ope... [23:57:35] welp. Restarting trafficserver seems to have fixed it. [23:58:15] :D [23:58:48] I note there are no errors in journalctl -u trafficserver and the systemd unit was "running" [23:59:42] \o/ [23:59:44] nice [23:59:55] I was about to report: I can hit the mediawiki servers from the cache-text servers and I've tried restarting varnish-frontend and trafficserver-tls to no avail. Then I realized I hadn't tried the trafficserver service. But there are no logs that give any indication why that should have worked :\