[00:32:02] <wikibugs>	 10Release-Engineering-Team, 10Documentation, 10User-brennen: Update/organize train deployment and related policy documentation - https://phabricator.wikimedia.org/T273802 (10brennen) 05Open→03Invalid This task as-written no longer reflects reality.  At this writing:    - We have a longstanding training p...
[00:38:53] <wikibugs>	 10GitLab (Upstream pit of despair 🕳️), 10Phabricator, 10Release-Engineering-Team (They Live 🕶️🧟), 10User-brennen: GitLab stops posting to gitlab-phabricator system hook after a failed request to the hook - https://phabricator.wikimedia.org/T329793 (10brennen)
[00:53:43] <wikibugs>	 10GitLab (Upstream pit of despair 🕳️), 10Phabricator, 10Release-Engineering-Team (Radar), 10User-brennen: GitLab stops posting to gitlab-phabricator system hook after a failed request to the hook - https://phabricator.wikimedia.org/T329793 (10brennen) 05Open→03Stalled p:05Medium→03Low It looks like...
[07:05:18] <wikibugs>	 10Scap, 10Patch-For-Review: deploy-promote breaks if HTML meta tags are not self closing - https://phabricator.wikimedia.org/T336269 (10CodeReviewBot) hashar merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/138  deploy-promote: make version matching regex less brittle
[07:12:28] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Jobs deployed!" [integration/config] - 10https://gerrit.wikimedia.org/r/918474 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[07:13:39] <wikibugs>	 (03Merged) 10jenkins-bot: inference-services: add RevertRisk Wikidata pipelines [integration/config] - 10https://gerrit.wikimedia.org/r/918474 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[07:15:00] <wikibugs>	 10GitLab (Integrations), 10Phabricator, 10Release-Engineering-Team (They Live 🕶️🧟), 10phabricator maintenance bot, 10User-brennen: GitLab comments should come from a CodeReviewBot instead of gerritbot - https://phabricator.wikimedia.org/T327424 (10hashar) Thank you @brennen and @Ladsgroup , that will sli...
[07:17:32] <wikibugs>	 (03PS1) 10Hashar: inferences-services: add jobs to Zuul workflow [integration/config] - 10https://gerrit.wikimedia.org/r/918970 (https://phabricator.wikimedia.org/T333125)
[07:17:35] <hashar>	 !log Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/918474 | T333125
[07:17:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[07:17:38] <stashbot>	 T333125: Deploy Revert-risk wikidata model to ml-staging - https://phabricator.wikimedia.org/T333125
[07:17:51] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] inferences-services: add jobs to Zuul workflow [integration/config] - 10https://gerrit.wikimedia.org/r/918970 (https://phabricator.wikimedia.org/T333125) (owner: 10Hashar)
[07:18:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "The jobs are added to the Zuul pipelines with the follow up change https://gerrit.wikimedia.org/r/c/integration/config/+/918970" [integration/config] - 10https://gerrit.wikimedia.org/r/918474 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[07:19:15] <wikibugs>	 (03Merged) 10jenkins-bot: inferences-services: add jobs to Zuul workflow [integration/config] - 10https://gerrit.wikimedia.org/r/918970 (https://phabricator.wikimedia.org/T333125) (owner: 10Hashar)
[07:19:32] <hashar>	 !log Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/918970 | T333125
[07:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[07:30:17] <wikibugs>	 10GitLab (Project Migration), 10Release-Engineering-Team (They Live 🕶️🧟): Create new GitLab project group: Technical documentation - https://phabricator.wikimedia.org/T336058 (10KBach) No worries, thank you!
[07:50:31] <wikibugs>	 10Scap: deploy-promote breaks if HTML meta tags are not self closing - https://phabricator.wikimedia.org/T336269 (10jnuche) a:03jnuche
[07:50:40] <wikibugs>	 10Scap: deploy-promote breaks if HTML meta tags are not self closing - https://phabricator.wikimedia.org/T336269 (10jnuche) 05Open→03Resolved
[08:35:31] <jelto>	 GitLab needs a short maintenance break in one hour
[08:47:01] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: Cookbook sre.gitlab.upgrade fails when unpausing runners - https://phabricator.wikimedia.org/T335855 (10Jelto) 05In progress→03Resolved This is solved, cookbook uses correct number of retries now. Thanks for the fix @Volans and @eoghan !
[08:55:36] <jelto>	 gitlab upgrade will happen later today due do ongoing deployment of mediawiki and hard dependeny, see T336162
[08:55:37] <stashbot>	 T336162: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162
[09:00:51] <wikibugs>	 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10serviceops-collab: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10Jelto) Does scap has to fetch release tools from GitLab directly? I have some concerns of GitLab breaking the complete deployment workflow, alth...
[09:06:51] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: Add GitLab upgrades and maintenance to deployment calendar - https://phabricator.wikimedia.org/T336470 (10Jelto)
[09:54:17] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: GitLab test instances fails to reconfigure/restart due to letsencrypt issues - https://phabricator.wikimedia.org/T336476 (10Jelto)
[09:54:24] <wikibugs>	 (03CR) 10AikoChou: inference-services: add RevertRisk Wikidata pipelines (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/918474 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[09:54:36] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: GitLab test instances fails to reconfigure/restart due to letsencrypt issues - https://phabricator.wikimedia.org/T336476 (10Jelto)
[11:07:46] <wikibugs>	 (03PS1) 10QChris: Allow “Gerrit Managers” to import history [services/ipoid] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/919031
[11:07:48] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [services/ipoid] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/919031 (owner: 10QChris)
[11:07:50] <wikibugs>	 (03PS1) 10QChris: Import done. Revoke import grants [services/ipoid] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/919032
[11:07:52] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [services/ipoid] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/919032 (owner: 10QChris)
[11:15:10] <wikibugs>	 10GitLab (Auth & Access), 10Release-Engineering-Team (They Live 🕶️🧟), 10CAS-SSO, 10Infrastructure-Foundations, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) a:03Jelto Refactoring of omniauth providers looks good on all instances. Changes as...
[11:16:38] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: GitLab test instances fails to reconfigure/restart due to letsencrypt issues - https://phabricator.wikimedia.org/T336476 (10Jelto)
[11:21:57] <wikibugs>	 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10serviceops-collab: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10akosiaris) >>! In T336162#8843891, @Jelto wrote: > Does scap has to fetch release tools from GitLab directly? I have some concerns of GitLab bre...
[11:58:30] <wikibugs>	 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10serviceops-collab: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10jnuche) a:03jnuche
[12:23:22] <wikibugs>	 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10serviceops-collab, 10Patch-For-Review: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10CodeReviewBot) jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/139  kubernetes: turn a faile...
[12:23:32] <wikibugs>	 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10serviceops-collab, 10Patch-For-Review: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10CodeReviewBot)
[12:23:34] <wikibugs>	 10Release-Engineering-Team, 10API Platform, 10AQS2.0, 10Platform Engineering, and 5 others: Define a procedure/pattern to populate test environments - https://phabricator.wikimedia.org/T334851 (10Sfaci) How fast @Htriedman! Ok, I'll start playing with this first approach and I'll let you know how it's work...
[12:34:38] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: Cookbook sre.gitlab.upgrade fails when another backup is in progress - https://phabricator.wikimedia.org/T336490 (10Jelto)
[12:40:22] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: Cookbook sre.gitlab.upgrade fails when another backup is in progress - https://phabricator.wikimedia.org/T336490 (10Jelto)
[12:48:57] <wikibugs>	 10GitLab (Infrastructure), 10serviceops-collab: Investigate incremental backups for GitLab - https://phabricator.wikimedia.org/T324506 (10Jelto) 05Open→03Resolved Then let's close the task and focus on T316935.
[12:49:01] <wikibugs>	 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops-collab, 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Jelto)
[13:29:04] <wikibugs>	 (03Abandoned) 10Hashar: parameter_functions: Fix dependencies for StickyTOC [integration/config] - 10https://gerrit.wikimedia.org/r/904894 (owner: 10Vedmaka Wakalaka)
[13:33:21] <wikibugs>	 (03Abandoned) 10Hashar: Archive analytics/wikistats [integration/config] - 10https://gerrit.wikimedia.org/r/898745 (https://phabricator.wikimedia.org/T332004) (owner: 10Hashar)
[14:13:08] <wikibugs>	 10Phabricator, 10Content-Transform-Team-WIP: Remove Herald rule tagging #Product-Infrastructure-Team-Backlog-Deprecated (H228) - https://phabricator.wikimedia.org/T336151 (10cscott) {T328586} looks like a similar task that @JMcLeod_WMF had been working on.  It should probably be reassigned to @msantos as well.
[14:30:51] <icinga-wm>	 PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:33:01] <wmf-insecte>	 Project beta-code-update-eqiad build #443280: 04FAILURE in 0.78 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443280/
[14:37:05] <icinga-wm>	 RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:08] <MatmaRex>	 hi, i have a weird problem. i have a tool with a sqlite database, which has grown to 700 MB. code using this database from kubernetes cron jobs works perfectly, but i can't read it from a ssh connection using the `sqlite3` executable - any query returns "Error: disk I/O error". any clues why that would happen? is it just not a good idea to have a database of this size on toolforge?
[14:41:20] <MatmaRex>	 to reproduce: sqlite3 /data/project/dtcheck/public_html/database.sqlite "select * from meta"
[14:41:35] <MatmaRex>	 (yes it's public_html and world-readable, that's ok)
[14:43:05] <bd808>	 MatmaRex: this is probably better asked in the #wikimedia-cloud channel. The fine folks of RelEng aren't responsible for Toolforge.
[14:43:23] <MatmaRex>	 oh whoops. wrong channel indeed
[14:43:25] <bd808>	 That being said, I'll poke around a bit and see if I can think of why this is breaking
[14:43:42] <MatmaRex>	 sorry. let's continue there :)
[14:45:12] <wmf-insecte>	 Project beta-code-update-eqiad build #443281: 04STILL FAILING in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443281/
[14:53:31] <wmf-insecte>	 Project beta-code-update-eqiad build #443282: 04STILL FAILING in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443282/
[14:55:42] <wmf-insecte>	 Project beta-code-update-eqiad build #443283: 04STILL FAILING in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443283/
[15:02:20] <wmf-insecte>	 Project mwcore-phpunit-coverage-master build #2855: 04FAILURE in 2 min 19 sec: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/2855/
[15:03:44] <wmf-insecte>	 Project beta-code-update-eqiad build #443284: 04STILL FAILING in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443284/
[15:05:55] <wmf-insecte>	 Project beta-code-update-eqiad build #443285: 04STILL FAILING in 2 min 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443285/
[15:06:36] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #43207: 04FAILURE in 2 min 11 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/43207/
[15:11:42] <wmf-insecte>	 Project beta-code-update-eqiad build #443286: 15ABORTED in 1 min 34 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443286/
[15:15:01] <jnuche>	 I've temporarily disabled the beta jobs
[15:15:12] <wmf-insecte>	 Project beta-code-update-eqiad build #443287: 04STILL FAILING in 2 min 11 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443287/
[15:15:24] <jnuche>	 CI hosts are still unable to reach gerrit
[15:30:55] <wmf-insecte>	 Yippee, build fixed!
[15:30:55] <wmf-insecte>	 Project beta-code-update-eqiad build #443288: 09FIXED in 2 min 15 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443288/
[15:33:12] <wikibugs>	 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops-collab: Add GitLab upgrades and maintenance to deployment calendar - https://phabricator.wikimedia.org/T336470 (10brennen) Regular windows can be added to [[https://gitlab.wikimedia.org/repos/releng/release/-/blob/main/make-deployment-calend...
[15:35:08] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Patch-For-Review, 10Release, 10Train Deployments: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 (10taavi)
[15:51:02] <wikibugs>	 10Gerrit, 10Release-Engineering-Team, 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) This has now happened. gerrit.wikimedia.org is now on new hardware, a new IP and a new distro version.
[16:07:40] <bd808>	 mutante: I'm not able to reach gerrit.wikimedia.org [2620:0:861:2:208:80:154:151] via IPv6 on port 29418 from my home network. Any idea what I should do to try and debug.
[16:08:18] <bd808>	 If I force IPv4 with `ssh -v -4 bd808@gerrit.wikimedia.org -p 29418` things work fine
[16:09:15] <bd808>	 If I force IPv6 (or default in my stack) `ssh -v -6 bd808@gerrit.wikimedia.org -p 29418` it hangs at  "debug1: Connecting to gerrit.wikimedia.org [2620:0:861:2:208:80:154:151] port 29418"
[16:11:50] <wmf-insecte>	 Yippee, build fixed!
[16:11:51] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #43208: 09FIXED in 7 min 25 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/43208/
[16:16:10] <wikibugs>	 10Gerrit, 10SRE: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10bd808)
[16:24:40] <thcipriani>	 hrm, I can confirm that gerrit is listening on all interfaces on port 29418 and iptables is set up correctly to allow connections from anywhere. So something between bd808 and this machine is amiss.
[16:25:52] <hashar>	 !log Reloading Zuul
[16:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[16:25:59] <bd808>	 thcipriani: do you have an off-cluster IPv6 to test from?
[16:26:12] <hashar>	 I have ipv6 here
[16:26:40] <bd808>	 hashar: context is T336524
[16:26:40] <stashbot>	 T336524: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524
[16:26:44] <hashar>	 debug1: Connecting to gerrit.wikimedia.org [2620:0:861:2:208:80:154:151] port 29418.
[16:26:44] <hashar>	 debug1: connect to address 2620:0:861:2:208:80:154:151 port 29418: No route to host
[16:26:44] <hashar>	 ssh: connect to host gerrit.wikimedia.org port 29418: No route to host
[16:26:52] <thcipriani>	 bd808: can you confirm the ipv6 is what you pasted? 2620:0:861:2:208:80:154:151 ? that...loooks not correct
[16:27:06] <hashar>	 it is not announced maybe
[16:27:21] <bd808>	 `dig AAAA gerrit.wikimedia.org` returns me 2620:0:861:2:208:80:154:151
[16:27:41] <hashar>	 there was a change made for the ip v4 at https://gerrit.wikimedia.org/r/c/operations/homer/public/+/919151/1/definitions/static.net
[16:27:50] <hashar>	 and apparently that one has the ipv6
[16:27:59] <hashar>	 but that was to fix the routing between WMCS and production 
[16:28:21] <bd808>	 those are the DNS announced public IPs
[16:28:42] <taavi>	 I can reproduce the issue on one of my hosts with IPv6 connectivity
[16:29:00] <thcipriani>	 maybe I'm reading `ip a` wrong, but I see: inet6 2620:0:861:2:208:80:154:51/128 scope global deprecated
[16:29:02] <hashar>	 then if I do  a traceroute (well with mtr) over ipv6    packets are lost
[16:29:10] * bd808 is glad to not be uniquely affected
[16:29:10] <thcipriani>	 note 51 vs 151
[16:29:30] <hashar>	 and eventually after sometime some packets manage to reach xe-5-3-3-500.cr1-eqiad.wikimedia.org
[16:29:48] <hashar>	 that sounds like a network announcement/routing issue of some sort :/
[16:29:52] <bd808>	 thcipriani: oh interesting. https://phabricator.wikimedia.org/T326368#8776527
[16:30:24] <taavi>	 in case this is helpful: https://phabricator.wikimedia.org/P48202
[16:30:33] <bd808>	 sounds like maybe a data entry error where ever the IPv6 is set?
[16:31:02] <thcipriani>	 possibly? (or I have no idea how to read ipv6 addresses: also a distinct possibility)
[16:31:17] <taavi>	 the ipv6 on puppet.git:hieradata/hosts/gerrit1003.yaml seems to just have a typo with the missing 1
[16:32:20] <hashar>	 ah
[16:32:36] <bd808>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/909796/2/hieradata/hosts/gerrit1003.yaml
[16:33:58] <hashar>	 which mean the ipv6 rules to allow port 80 / 443 / 29418 are wrong
[16:34:02] <wikibugs>	 10Gerrit, 10SRE: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10bd808)  https://gerrit.wikimedia.org/r/c/operations/puppet/+/909796/2/hieradata/hosts/gerrit1003.yaml has a typo in the IPv6 address.
[16:34:12] <hashar>	 which also would explain why Zuul takes so long to report back to gerrit maybe
[16:34:25] <bd808>	 likely, yes
[16:34:37] <bd808>	 it would wait for the IPv6 to time out
[16:34:50] <hashar>	 let me craft th epatch
[16:37:36] <hashar>	 I need an ipv6 kill switch :)
[16:38:31] <hashar>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/919161
[16:38:35] <hashar>	 thanks taavi :)
[16:40:06] <wikibugs>	 10Gerrit, 10Release-Engineering-Team, 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10hashar)
[16:40:11] <wikibugs>	 10Gerrit, 10SRE, 10Patch-For-Review: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10hashar)
[16:40:14] <taavi>	 hashar: it's not that the firewall rules are wrong, the host just does not have the service IP assigned to the correct interface at all so the packets are not routed to it
[16:44:01] <brennen>	 i should have been paying attention to this channel; i take it zuul's still backed up?
[16:44:13] <brennen>	 (as i attempt a train rollback for diagnostic purposes)
[16:45:38] <taavi>	 brennen: correct, but we believe we found the cause and are applying the fix
[16:46:10] <taavi>	 I'd still consider manually submitting the rollback patch, since iirc the ci checks in the repo are mostly meant for config changes so waiting doesn't really serve a purpose
[16:46:11] <brennen>	 taavi: thx.  rollback is for a skin regression, so nothing is on fire.
[16:46:30] <brennen>	 hmm.  i suppose that would work.
[16:46:40] <hashar>	 solved!
[16:46:51] <hashar>	 I have to get rid of the old ip though
[16:47:28] <wikibugs>	 10Gerrit, 10SRE, 10Patch-For-Review: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10bd808) 05Open→03Resolved a:03hashar ` $ ssh -6 bd808@gerrit.wikimedia.org -p 29418    ****    Welcome to Gerrit Code Review    ****    Hi BryanDavis, you have su...
[16:47:30] <wikibugs>	 10Gerrit, 10Release-Engineering-Team, 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10bd808)
[16:47:48] <taavi>	 brennen: looks like being in the priority queue was enough
[16:47:54] <brennen>	 yup
[16:49:20] <wikibugs>	 10Gerrit, 10SRE, 10Patch-For-Review: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10hashar) I have removed the faulty IPv6 from /etc/network/interfaces and manually removed it with: ` ip addr del 2620:0:861:2:208:80:154:51/128 dev eno8303 `
[16:49:36] <bd808>	 possible follow up: add some IPv6 connectivity checks for gerrit.wm.o based on DNS lookups
[16:50:41] <hashar>	 bd808: well spotted
[16:50:57] <taavi>	 there was an icinga alert: 19:45:41 <+icinga-wm> RECOVERY - Host gerrit.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[16:51:03] <hashar>	 cause I have spent like an hour trying to understand what was going with Zuul having a very slow report back to gerrit
[16:51:24] <hashar>	 which I guess was due to X retries * Y seconds being randomly it by wheter it gets the IPv4 or the IPv6
[16:51:39] <bd808>	 hashar: heh. I figured it out because I was trying to push a commit to gerrit and it took way, way too long
[16:52:04] <hashar>	 `export GIT_SSH_COMMAND="ssh -4"`
[16:52:04] <hashar>	 :D
[16:52:13] <hashar>	 (that is how I have send the patch for review)
[19:37:36] <MatmaRex>	 is it okay to sync-file some live hacks on the beta cluster for debugging? i haven
[19:37:46] <MatmaRex>	 't ever done that before, not sure what's the etiquette
[19:52:05] <MatmaRex>	 (well, i'm trying then)
[19:52:28] <thcipriani>	 MatmaRex: I'm wondering if they'll get blown away by the 10min deploy, checking now
[19:52:46] <MatmaRex>	 probably, but that's okay
[19:52:48] <MatmaRex>	 seems to have worked https://en.wikipedia.beta.wmflabs.org/wiki/Special:BlankPage
[19:52:56] <MatmaRex>	 i didn't see a log message somewhere. i expected one
[19:53:05] <MatmaRex>	 i also got a python exception from scap. interesting
[19:53:39] <thcipriani>	 paste?
[19:54:02] <MatmaRex>	 19:51:22 sync-file failed: <PermissionError> [Errno 13] Permission denied: '/srv/mediawiki-staging/scap/log/history.log'
[19:54:07] <MatmaRex>	 want the full stack trace?
[19:54:14] <thcipriani>	 ah, that's enough
[19:56:01] <thcipriani>	 (and that's interesting, group writable in prod, but not on beta, probably something to investigate)
[19:56:39] <thcipriani>	 anyway, if deploy worked, then: guess it's fine :)
[19:57:17] <MatmaRex>	 i'll play with things some more if that's cool, i'll try not to break it too badly
[19:57:23] <MatmaRex>	 shoud i file a task for that one?
[19:58:19] <MatmaRex>	 (i want to apply https://gerrit.wikimedia.org/r/c/mediawiki/core/+/919221 and see if it explains https://phabricator.wikimedia.org/T336504 (train blocker))
[19:58:49] <thcipriani>	 nah, that's ok, it'd end up on the bottom of the heap since this is not something that happens too often. And thank you for digging into the train blocker—looks like a gnarly one :(
[20:01:56] <wikibugs>	 10Release-Engineering-Team (They Live 🕶️🧟), 10Release, 10Train Deployments: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 (10thcipriani) p:05Triage→03Medium a:03dancy
[20:23:02] <wmf-insecte>	 Project beta-code-update-eqiad build #443318: 04FAILURE in 2.1 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443318/
[20:24:00] <MatmaRex>	 hmm, i think i broke it :)
[20:24:27] <MatmaRex>	 i tried to checkout another commit. but i got file permissions errors, and now i can't checkout master again either
[20:24:51] <MatmaRex>	 in deployment-deploy03:/srv/mediawiki-staging/php-master
[20:25:59] <RhinosF1>	 What error do you get
[20:28:43] <MatmaRex>	 when doing what specifically?
[20:29:03] <MatmaRex>	 i get screenfuls of errors about untracked and modified files
[20:29:05] <RhinosF1>	 Checking out master
[20:29:13] <MatmaRex>	 i can copy them if you'd like
[20:29:14] <RhinosF1>	 Fun
[20:29:18] <Reedy>	 sudo
[20:29:31] <thcipriani>	 ^
[20:29:33] <MatmaRex>	 oh, am i allowed to? okay
[20:29:43] <thcipriani>	 if you're not allowed to, lemme know
[20:30:09] <thcipriani>	 ....and I think I'll pause beta-code-update for a moment :)
[20:30:10] <MatmaRex>	 that worked. thanks
[20:30:36] <MatmaRex>	 do i need to sudo all git commands?
[20:30:49] <MatmaRex>	 i could edit the files previously just fine without sudo
[20:31:04] <thcipriani>	 sudo -u jenkins-deploy would probably keep things...happier
[20:31:22] <MatmaRex>	 (i am considering trying to bisect the train blocker on the beta cluster, since it's easily reproducible there, but not locally)
[20:31:30] <Reedy>	 I think it's the git objects that end up with... odd... permissions
[20:31:36] <thcipriani>	 ^
[20:31:44] <Reedy>	 so the files themselves you can probably edit fine due to the groups and stuff
[20:31:45] <thcipriani>	 mostly owned jenkins-deploy:wikidev
[20:31:55] <MatmaRex>	 i didn't want to sudo, since i worried that then jenkins wouldn't be able to write them again
[20:32:04] <MatmaRex>	 if i accidentally created some files owned by root
[20:32:27] <Reedy>	 at worst a sudo chown -R
[20:32:31] <RhinosF1>	 ^
[20:32:38] <thcipriani>	 if you do: sudo -u jenkins-deploy -i
[20:32:39] <MatmaRex>	 heh, well, i'd do tht if it was my machine
[20:33:01] <thcipriani>	 that'll give you a login shell as the jenkins-depoy user, which should be the Right™ user for modifying things in the mediawiki tree on beta
[20:33:15] <MatmaRex>	 thcipriani: thanks, that's neat. i'll try that
[20:35:15] <wmf-insecte>	 Yippee, build fixed!
[20:35:15] <wmf-insecte>	 Project beta-code-update-eqiad build #443319: 09FIXED in 2 min 14 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443319/
[20:38:02] <wmf-insecte>	 Project beta-scap-sync-world build #102732: 04FAILURE in 2 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/102732/
[20:40:03] <thcipriani>	 ok, now I'm actually going to pause ^ MatmaRex lemme know when you're done and I'll unpause/cleanup
[20:40:27] <RhinosF1>	 Why did that even fail thcipriani
[20:41:52] <thcipriani>	 !log pausing beta-code-update-eqiad/beta-mediawiki-config-update-eqiad/beta-scap-sync-world for debugging train
[20:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[20:42:55] <MatmaRex>	 (i haven't done anything in the last 10 minutes)
[20:43:06] <thcipriani>	 RhinosF1: I am assuming live hacking. Scap does a little local check with "mwscript eval.php" to ensure that there are no errors locally before syncing everywhere and that failed, "mwscript eval.php --wiki aawiki' generated unexpected output: Notice: Undefined variable: wgGERestbaseUrl"
[20:43:34] <RhinosF1>	 So a notice is a fail?
[20:43:41] <RhinosF1>	 Cool
[20:43:46] <RhinosF1>	 That confused me
[20:43:49] <thcipriani>	 yeah, No New Messsages :)
[20:44:08] <RhinosF1>	 But no
[20:44:11] <RhinosF1>	 Not live hacking
[20:44:12] <RhinosF1>	 Maybe
[20:45:03] <RhinosF1>	 thcipriani: that's an actual error
[20:45:42] <RhinosF1>	 urbanecm: your patch has just cleaned some of ^ up, should there be a beta equivalent?
[20:45:48] <RhinosF1>	 Or is there and it's just waiting
[20:46:28] <urbanecm>	 that...might be me
[20:46:31] <urbanecm>	 I'll go check
[20:47:31] <thcipriani>	 FWIW, it's: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+blame/refs/heads/master/wmf-config/CommonSettings-labs.php#373
[20:49:08] <urbanecm>	 uploaded https://gerrit.wikimedia.org/r/919235 as a fix
[20:49:12] <urbanecm>	 thcipriani: RhinosF1: mind quick review?
[20:50:46] <RhinosF1>	 urbanecm: +1
[20:50:49] <urbanecm>	 ty
[20:51:13] <RhinosF1>	 urbanecm: might take a bit to get deployed on beta though because MatmaRex is live hacking
[20:51:45] <urbanecm>	 i'm backporting sth else atm anyway
[20:51:51] <MatmaRex>	 i'm happy to wait if you want to fix things
[20:52:17] <RhinosF1>	 MatmaRex: beta scap being grumpy for a bit is fine
[20:52:18] <urbanecm>	 MatmaRex: just deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/919235/ to beta to fix the unhappy scap :)
[20:52:46] <urbanecm>	 feel free to continue if the undefined complaint doesn't bother you; it'll be fixed with next scap anyway
[20:53:57] <MatmaRex>	 i'm not sure how do actually do that, so i think i'll leave it
[20:54:09] <MatmaRex>	 i was poking at things in mediawiki/core
[20:55:38] <urbanecm>	 shell.php currently complains re wgGERestbaseUrl. if that doesn't harm your testing (i think it shouldN't, but i might be wrong), feel free to finish and once you're done (and scap is re-enabled), the notice will be fixed. otherwise, we can re-enable scap now to fix it too.
[20:56:41] <RhinosF1>	 thcipriani: can help if urbanecm goes to sleep MatmaRex
[20:56:48] <RhinosF1>	 You should help
[20:56:54] <RhinosF1>	 You should be fine
[20:56:56] <RhinosF1>	 They will help
[20:56:59] <RhinosF1>	 I'm tired
[20:57:03] <MatmaRex>	 :)
[20:57:14] <RhinosF1>	 You should be fine
[20:57:19] <RhinosF1>	 That's what I was trying to say
[20:57:29] <urbanecm>	 RhinosF1: what do you mean, _if_ i go to sleep. i'll most definitely go to sleep...fairly soon :-D
[20:57:44] <RhinosF1>	 urbanecm: you should be asleep already
[21:13:03] <TheresNoTime>	 Got poked that the `beta-code-update-eqiad` job hasn't run in a while — has it been (re)paused?
[21:14:09] * TheresNoTime sees the SAL message
[21:16:10] <hashar>	 yeah
[21:16:17] <hashar>	 we paused it during the gerrit upgrade
[21:17:35] <hashar>	 ah no
[21:17:42] <hashar>	 <thcipriani> pausing beta-code-update-eqiad/beta-mediawiki-config-update-eqiad/beta-scap-sync-world for debugging train
[21:17:49] <hashar>	 at 20:41 UTC
[21:18:06] <hashar>	 so I have left https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ disabled for now
[21:19:04] <RhinosF1>	 hashar: see above
[21:19:14] <RhinosF1>	 TheresNoTime: yes, MatmaRex is debugging stuff in core
[21:19:38] <MatmaRex>	 (hi)
[21:19:56] <RhinosF1>	 MatmaRex: carry on doing your stuff
[21:20:27] <RhinosF1>	 !log beta scap is paused while MatmaRex tests some changes
[21:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[21:20:50] * hashar heads back to call && sleep
[21:20:51] <hashar>	 ;)
[21:29:19] <MatmaRex>	 how would i re-enable normal updates once i'm done?
[21:32:07] <thcipriani>	 MatmaRex: ping me :)
[21:32:20] <thcipriani>	 (I think this is limited to ciadmins)
[21:32:21] <MatmaRex>	 heh, okay
[21:49:20] <MatmaRex>	 (i'm making progress)
[21:51:53] <thcipriani>	 <33
[21:52:22] <MatmaRex>	 it turned out we don't know which extension broke it, so i'm bisecting the list of extensions
[21:52:30] <MatmaRex>	 reverting half of them, syncting, testing, etc
[22:09:42] <MatmaRex>	 thcipriani: thanks, i'm done. can you unpause the things?
[22:10:05] <MatmaRex>	 also, will that bring back all extension repos to master? i may have left some on weird commits
[22:10:57] <thcipriani>	 MatmaRex: doing
[22:11:09] <thcipriani>	 let's find out if it magically fixes thing
[22:11:11] <thcipriani>	 s
[22:11:34] <thcipriani>	 MatmaRex: do you have a repo with a debug commit I can check after a run?
[22:12:59] <MatmaRex>	 not sure what you mean
[22:13:50] <MatmaRex>	 oh, do you mean some recently merged commit that i could test to see if it's live? not really
[22:14:01] <MatmaRex>	 i can just checkout everything to origin/master
[22:14:37] <MatmaRex>	 (brb)
[22:24:48] <MatmaRex>	 thcipriani: let me know if i should do something else. otherwise i'll be off for tonight in a couple minutes. thanks for the help!
[22:32:49] <thcipriani>	 MatmaRex: I think I can figure it out, I just re-enabled everything and we'll see how it goes. Thanks for digging, have a good night <3
[22:33:01] <thcipriani>	 we'll see what explodes in 10 min
[22:33:04] <wmf-insecte>	 Project beta-code-update-eqiad build #443320: 04FAILURE in 3.3 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443320/
[22:33:08] <MatmaRex>	 heh
[22:33:09] <thcipriani>	 or now :)
[22:34:43] <thcipriani>	 alright, that one should be fixed
[22:40:02] <wmf-insecte>	 Yippee, build fixed!
[22:40:03] <wmf-insecte>	 Project beta-code-update-eqiad build #443321: 09FIXED in 4 min 9 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/443321/
[22:40:37] <thcipriani>	 \o/ ... now let's see if that was it :)
[22:44:58] <MatmaRex>	 thanks. gnight
[22:50:21] <wmf-insecte>	 Yippee, build fixed!
[22:50:22] <wmf-insecte>	 Project beta-scap-sync-world build #102733: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/102733/
[22:52:43] <brennen>	 ok bisecting the extension list is brilliant.
[22:54:52] <thcipriani>	 ^
[23:07:45] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Patch-For-Review, 10Release, 10Train Deployments: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 (10brennen) End-of-North-American-workday notes: We're still at group1, pending a fix for {T336504}. Discussed with @hashar and h...
[23:14:23] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Patch-For-Review, 10Release, 10Train Deployments: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 (10brennen) Per discussion with @Jdlrobson, we can probably also treat that bug as non-blocking for the train since we're close t...