[08:35:13] (03PS1) 10Krinkle: zuul: Add words "sqlite" and "postgres" to "check php" description [integration/config] - 10https://gerrit.wikimedia.org/r/858189 [08:35:30] (03CR) 10Krinkle: [C: 03+2] zuul: Add words "sqlite" and "postgres" to "check php" description [integration/config] - 10https://gerrit.wikimedia.org/r/858189 (owner: 10Krinkle) [08:37:31] (03Merged) 10jenkins-bot: zuul: Add words "sqlite" and "postgres" to "check php" description [integration/config] - 10https://gerrit.wikimedia.org/r/858189 (owner: 10Krinkle) [08:46:13] (03PS1) 10Krinkle: zuul: Increase font size of pipeline-desc and queue-desc slightly [integration/docroot] - 10https://gerrit.wikimedia.org/r/858190 [08:47:49] (03CR) 10Krinkle: [C: 03+2] zuul: Increase font size of pipeline-desc and queue-desc slightly [integration/docroot] - 10https://gerrit.wikimedia.org/r/858190 (owner: 10Krinkle) [08:48:30] (03Merged) 10jenkins-bot: zuul: Increase font size of pipeline-desc and queue-desc slightly [integration/docroot] - 10https://gerrit.wikimedia.org/r/858190 (owner: 10Krinkle) [09:13:01] Project beta-code-update-eqiad build #418151: 04FAILURE in 0.82 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418151/ [09:24:55] Yippee, build fixed! [09:24:56] Project beta-code-update-eqiad build #418152: 09FIXED in 1 min 54 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418152/ [09:27:32] 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 (10Krinkle) [09:53:01] Project beta-code-update-eqiad build #418155: 04FAILURE in 0.81 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418155/ [09:56:27] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.006 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:57:11] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:57:19] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:19] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:05] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:01] Project beta-code-update-eqiad build #418156: 04STILL FAILING in 0.79 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418156/ [10:04:27] Project mediawiki-core-doxygen-docker build #38938: 04FAILURE in 0.56 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/38938/ [10:07:51] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:01] Project beta-code-update-eqiad build #418157: 04STILL FAILING in 0.86 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418157/ [10:14:25] 10Release-Engineering-Team (Priority Backlog 📥), 10Gerrit (Gerrit 3.5): Upgrade to Gerrit 3.5 - https://phabricator.wikimedia.org/T307334 (10hashar) The update had some issues: * Due to the upgrade, all changes had to be reindexed. I have freaked out a bit cause we had 32 threads claiming to reindex all chang... [10:15:57] 10Release-Engineering-Team (Priority Backlog 📥), 10Gerrit (Gerrit 3.5): Upgrade to Gerrit 3.5 - https://phabricator.wikimedia.org/T307334 (10hashar) I have started a full offline reindexing at 9:56 UTC. [10:23:01] Project beta-code-update-eqiad build #418158: 04STILL FAILING in 0.8 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418158/ [10:33:01] Project beta-code-update-eqiad build #418159: 04STILL FAILING in 0.83 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418159/ [10:36:07] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:01] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:53] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:01] Project beta-code-update-eqiad build #418160: 04STILL FAILING in 0.77 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418160/ [10:43:55] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:01] Project beta-code-update-eqiad build #418161: 04STILL FAILING in 0.84 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418161/ [10:57:45] 10Gerrit: Investigate changes having a wrong server id - https://phabricator.wikimedia.org/T323259 (10hashar) [10:59:55] PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [11:01:17] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 58823 bytes in 0.133 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:01:35] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:53] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [11:02:19] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.039 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:02:27] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:58] Yippee, build fixed! [11:04:59] Project beta-code-update-eqiad build #418162: 09FIXED in 1 min 57 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418162/ [11:12:40] Yippee, build fixed! [11:12:41] Project mediawiki-core-doxygen-docker build #38939: 09FIXED in 8 min 13 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/38939/ [11:19:48] PROBLEM - Disk space on gerrit1001 is CRITICAL: DISK CRITICAL - free space: / 1321 MB (2% inode=91%): /tmp 1321 MB (2% inode=91%): /var/tmp 1321 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops [11:34:52] Project beta-code-update-eqiad build #418165: 04FAILURE in 1 min 51 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418165/ [11:37:11] PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [11:37:39] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.007 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:37:55] PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [11:38:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:38:35] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:57] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:45] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:01] Project beta-code-update-eqiad build #418166: 04STILL FAILING in 0.82 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418166/ [11:44:26] 10Release-Engineering-Team, 10serviceops-collab: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [11:45:47] RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [11:46:24] 10Release-Engineering-Team (Priority Backlog 📥), 10Gerrit (Gerrit 3.5): Upgrade to Gerrit 3.5 - https://phabricator.wikimedia.org/T307334 (10hashar) The indexing completed at roughlly 11:05 UTC. That caused Gerrit to be automatically started either by Puppet or systemd. It had some more disk space issue. SR... [11:46:25] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 53356 bytes in 0.066 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:46:29] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:51] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:01] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [11:47:29] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:47:41] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:13] 10Release-Engineering-Team, 10serviceops-collab: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) Largest H2 files: ` -rw-r--r-- 1 gerrit2 gerrit2 8.2G Nov 17 11:47 git_file_diff.h2.db -rw-r--r-- 1 gerrit2 gerrit2 12G Nov 17 11:47 gerrit_file_diff.h2.db ` Then `g... [11:51:20] 10Release-Engineering-Team, 10serviceops-collab: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [11:53:24] 10Release-Engineering-Team, 10serviceops-collab: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) https://gerrit.wikimedia.org/r/Documentation/cmd-flush-caches.html [11:54:53] Yippee, build fixed! [11:54:53] Project beta-code-update-eqiad build #418167: 09FIXED in 1 min 51 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/418167/ [11:55:47] 10Release-Engineering-Team, 10serviceops-collab: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [11:57:01] 10Release-Engineering-Team, 10serviceops-collab: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) p:05Triage→03High [12:01:41] RECOVERY - Disk space on gerrit1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops [12:07:05] Gerrit is back since 11:45 UTC [12:09:53] time for everyone to get mad at the UI changes /s [12:10:00] jk – thanks a lot for doing the upgrade <3 [12:11:17] (I’m happy that we’re one step closer to 3.6 which will hopefully resolve those confusing SSH errors that people keep running into) [12:17:22] 10Release-Engineering-Team, 10serviceops-collab, 10Wikimedia-Incident: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10jcrespo) If an incident is planned to be written, let me add the corresponding tag for tracking purposes only (I'm on clinic duty this week). [12:18:53] ooooh, there’s a “copy link to this comment” button now 😍 [12:40:33] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:56] (03CR) 10Zfilipin: WikiLambda: run e2e tests daily on betacluster (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/856646 (https://phabricator.wikimedia.org/T294388) (owner: 10Stef Dunlap) [13:28:00] 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability (Incident Followup), 10Wikimedia-Incident: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) @jcrespo I am writing the incident report at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerr... [13:35:39] Lucas_WMDE: "get mad at the UI changes" SO true ;) [13:36:06] as I understand it Google has a whole team dedicated to the UI/UX and they push changes iteratively to their internal Gerrit instances [13:36:14] with some metrics / click tracking / feedback forms [13:36:37] all those eventually land in their point release which are twice per year so each time we upgrade we have ~ 6 months worth of changes [13:37:08] I am hoping to get 3.6 next but I haven't started looking at it yet [13:37:33] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:11] 10Release-Engineering-Team, 10Scap: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277 (10Urbanecm) [14:35:10] 10Release-Engineering-Team, 10Scap: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277 (10Urbanecm) FTR, that API URL indeed complains about multiple changes: ` [urbanecm@deploy1002 ~]$ curl https://gerrit.wikimedia.org/r/changes/Ifb... [14:45:41] 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, 10Sustainability (Incident Followup), 10Wikimedia-Incident: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [14:48:22] 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, 10Sustainability (Incident Followup), 10Wikimedia-Incident: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) I have marked with #wikimedia-incident-actionable and #sre-onfire based on the incident re... [14:48:32] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [14:52:14] 10Release-Engineering-Team, 10Scap: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277 (10Urbanecm) >>! In T323277#8402303, @Urbanecm wrote: > It does so for other changes as well, and `scap backport 858308 858309 858310 858311 858312... [14:55:50] 10Release-Engineering-Team (Priority Backlog 📥), 10Gerrit (Gerrit 3.5): Upgrade to Gerrit 3.5 - https://phabricator.wikimedia.org/T307334 (10hashar) 05Open→03Resolved We are on Gerrit 3.5.4 there are a few UI changes which would probably trigger some discussions here and there but overall everything else i... [15:04:29] 10Release-Engineering-Team (Priority Backlog 📥), 10Gerrit (Gerrit 3.5): Upgrade to Gerrit 3.5 - https://phabricator.wikimedia.org/T307334 (10hashar) [15:04:48] 10Release-Engineering-Team (Priority Backlog 📥), 10Gerrit (Gerrit 3.5), 10Patch-For-Review: Update Gerrit CI result table CSS style - https://phabricator.wikimedia.org/T315445 (10hashar) 05Open→03Resolved That has been addressed and the Gerrit 3.5 obsolete `@apply` statement has been removed. [15:08:08] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10serviceops-collab: Migrate GitLab Shared Runners from profile::gitlab::runner to role::gitlab_runner - https://phabricator.wikimedia.org/T322409 (10Jelto) 05Open→03Resolved `runner-1030` was running with the new role for one day as a canary and a... [15:11:14] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [15:47:51] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [16:11:13] I like the new Gerrit feature where hovering symbols highlights matches! [16:32:42] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [16:32:53] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [16:37:04] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [16:44:04] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [16:45:41] 10Deployments, 10MediaWiki-Configuration, 10Sustainability (Incident Followup): ConfigException: GlobalVarConfig::get: undefined option: 'VectorArticleTools' - https://phabricator.wikimedia.org/T322372 (10Krinkle) [16:46:32] 10Deployments, 10MediaWiki-Configuration, 10Sustainability (Incident Followup): Investigate seemingly-impossible "ConfigException: undefined option" in minutes after deployment - https://phabricator.wikimedia.org/T322372 (10Krinkle) [16:49:56] 10Release-Engineering-Team (Seen), 10dev-images, 10serviceops: Sync node versions between docker dev and slim images - https://phabricator.wikimedia.org/T265554 (10jijiki) 05Open→03Resolved a:03jijiki Bluntly closing this, please reopen if needed [16:54:18] 10Gerrit, 10Release-Engineering-Team, 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Jelto) > SRE: > > [ ] Bring primary and replica in sync configuration-wise (SRE) > [ ] summarize disk stuff (Partman recipe etc.... [16:57:30] 10Gerrit, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani) (meta note: tagging with the weird-for-this-task tag: #gitlab-boomerang because that's our curr... [16:59:11] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Clement_Goubert) 05In progress→03Resolved [16:59:19] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [16:59:31] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Clement_Goubert) 05In progress→03Resolved [16:59:41] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:00:01] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:00:13] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Create mw-api-ext helmfile deployment - https://phabricator.wikimedia.org/T321896 (10Clement_Goubert) 05In progress→03Resolved [17:00:23] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Clement_Goubert) 05In progress→03Resolved [17:00:34] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:00:52] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) [17:01:41] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [17:01:51] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:12:05] 10Diffusion, 10Release-Engineering-Team, 10serviceops-collab: reconsider https://git.wikimedia.org link - https://phabricator.wikimedia.org/T323073 (10Bugreporter) [17:35:20] (03PS1) 10Samtar: layout.yaml: Add composer-test-package for mediawiki/libs/IPAValidator [integration/config] - 10https://gerrit.wikimedia.org/r/858391 (https://phabricator.wikimedia.org/T322744) [17:41:31] (03CR) 10Majavah: [C: 03+2] "deploying" [integration/config] - 10https://gerrit.wikimedia.org/r/858391 (https://phabricator.wikimedia.org/T322744) (owner: 10Samtar) [17:43:21] (03Merged) 10jenkins-bot: layout.yaml: Add composer-test-package for mediawiki/libs/IPAValidator [integration/config] - 10https://gerrit.wikimedia.org/r/858391 (https://phabricator.wikimedia.org/T322744) (owner: 10Samtar) [17:44:08] !log reloading zuul to deploy https://gerrit.wikimedia.org/r/858391 [17:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:56:56] (03PS5) 10Jforrester: Zuul: [mediawiki/extensions/WikiLambda] Disable selenium tests [integration/config] - 10https://gerrit.wikimedia.org/r/856646 (https://phabricator.wikimedia.org/T294388) (owner: 10Stef Dunlap) [17:56:58] (03PS1) 10Jforrester: jjb: Define selenium-daily-betawikifunctions-WikiLambda [integration/config] - 10https://gerrit.wikimedia.org/r/858394 (https://phabricator.wikimedia.org/T294388) [18:14:08] 10Continuous-Integration-Config, 10IPA-Validator, 10Community-Tech (CommTech-Sprint-36): Configure Jenkins CI jobs - https://phabricator.wikimedia.org/T322744 (10TheresNoTime) 05Open→03Resolved a:03TheresNoTime [18:14:20] 10Continuous-Integration-Config, 10IPA-Validator, 10Community-Tech (CommTech-Sprint-36): Configure Jenkins CI jobs - https://phabricator.wikimedia.org/T322744 (10TheresNoTime) [18:27:21] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10JMeybohm) helmfile_log_sal has support for that already: ` # Allow to explicitely suppress logging to SAL SUPPRESS_SAL=${SUPPRES... [18:36:11] thcipriani: mutante and i have been discussing phab migration to phab1004, and are leaning towards a window on this monday the 21st. any thoughts on that? [18:46:56] 10Beta-Cluster-Infrastructure, 10Wikimedia-Site-requests, 10Logos: Add Vector-2022-style logos for the Beta Cluster - https://phabricator.wikimedia.org/T323306 (10Zabe) [18:47:18] 10Gerrit, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Dzahn) >>! In T323262#8403129, @Jelto wrote: > @Dzahn what are you thoughts on reimaging `gerrit2002` with... [18:52:31] brennen: mondays seem like good days for migrations. And a week with a holiday at the end may mean a lighter traffic time. How much downtime are you predicting for a switchover? [18:53:43] hopefully just a few minutes; probably best to schedule a 30 minute window for some margin. [18:53:52] +1 :) [18:54:25] we _think_ this will either be a situation where it pretty much works right away or we'll switch back to phab1001 and regroup. [18:54:26] like to get it done before it's December and also avoid "the day before Thankgiving" [18:54:38] :D [18:54:40] yea that, either it works or we revert [18:55:09] yeah, then monday sounds reasonable to me. [18:55:15] did all the database stuff get sorted? [18:55:15] today/tomorrow I am starting to switch little things like the stats email and the dumps creation [18:55:33] so that the actual switch does not include that to distract us [18:55:39] we know how to configure port correctly. we don't know how to run a readonly phab, but that seems like a question we can punt on. [18:56:38] cool [18:56:43] thcipriani: actually it had to be fixed in like 4 places. grants, puppet, scap, puppet again.. but yea, we now have "if active_server then use m3-master and 3306, otherwise use m3-slave and 3323" [18:56:59] heh, neat [18:57:07] the only disappointing part was.. Phab phd service does NOT like to run when the DB is readonly [18:57:51] but the dump and stats scripts.. they should work right now from the new machine [18:57:58] will verify that in a moment [18:59:27] so sounds like everything is over on phab1004 (if you know what's working and what fails) is the switch over a big red button (in the form of a puppet patch)? [19:00:05] yes, it is. https://gerrit.wikimedia.org/r/c/operations/puppet/+/858397 [19:00:21] and we can compile it [19:02:34] the thing that will, if i had to guess, bite us is that we'll probably have missed _something_ in doing the deploy with scap. [19:02:59] i'll spend some time trying to make sure that isn't the case before monday. [19:04:09] sounds good: send a note to wikitech-l@ + ops@ with the timing [19:04:38] if the readonly DB part worked we could test it on the new host before, but this way we kind of just have to switch [19:05:25] I am making a new ticket just about "make phab work readonly" [19:05:45] but we should do the switch regardless [19:08:12] mutante: i think late afternoon is best traffic-wise, i'll defer to you on exact timing [19:15:36] brennen: 2pm PST? Europe will be out but still have hours to revert :p [19:16:39] you are in the middle of deploy, ttyl [19:18:02] 2pm PST works for me. :) [19:18:31] ack [19:22:57] 10Phabricator, 10serviceops-collab: make Phabricator work with readonly DB - https://phabricator.wikimedia.org/T323312 (10Dzahn) [19:23:09] so.. wikitech and ops and slack? I can do it [19:24:58] ohhh gerrits been updated! [19:27:38] paladox: indeed. it's 3.5 now [19:27:51] there were things waiting for 3.5 [19:27:53] nice! [19:28:12] (03CR) 10Jforrester: [C: 03+2] "Deployed." [integration/config] - 10https://gerrit.wikimedia.org/r/858394 (https://phabricator.wikimedia.org/T294388) (owner: 10Jforrester) [19:30:09] (03CR) 10Jforrester: "PS5: Split. Let's land this?" [integration/config] - 10https://gerrit.wikimedia.org/r/856646 (https://phabricator.wikimedia.org/T294388) (owner: 10Stef Dunlap) [19:30:22] (03Merged) 10jenkins-bot: jjb: Define selenium-daily-betawikifunctions-WikiLambda [integration/config] - 10https://gerrit.wikimedia.org/r/858394 (https://phabricator.wikimedia.org/T294388) (owner: 10Jforrester) [19:37:09] mutante: phab has a RO flag [19:37:23] But phab's RO handling can't be called handling [19:37:46] RhinosF1: I just made https://phabricator.wikimedia.org/T323312 and linked to it .. I think? [19:37:57] if you have any experience with it, comments there are appreciated [19:38:16] that second part especially, heh [19:38:46] mutante: my experience is cry [19:39:11] ouch :/ [19:39:39] mutante: I can some up all the issues [19:39:47] But phab and RO are fun [19:40:27] please do:) [19:40:32] thank you! [19:40:45] 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), 10serviceops-collab, 10Patch-For-Review: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 (10Dzahn) We have also been hoping that we get a readonly Phabricator out of this in the inactive DC but this is n... [19:47:35] (03CR) 10Jforrester: [C: 03+2] build: Updating mediawiki/mediawiki-codesniffer to 40.0.1 [tools/code-utils] - 10https://gerrit.wikimedia.org/r/857953 (owner: 10Libraryupgrader) [19:48:34] migration plan added in Google calendar invite description :p [19:48:41] adds it to ticket though [19:49:29] thanks for handling that, mutante [19:50:17] yw, thanks as well for the help with DB connection [19:52:24] 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), 10serviceops-collab, 10Patch-For-Review: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 (10Dzahn) [19:54:24] yea, and it's not just common/hieradata.yaml it's also DNS of course [19:54:46] switching phabricator.discovery.wmnet. making patch for that [19:55:00] so there is a moment after switching DB and before switching DNS [19:55:12] where we could test it by going to new host via bastion [19:55:52] ! [remote rejected] HEAD -> refs/for/production%topic=wikiba.se (internal error) --> is gerrit happy? [19:57:45] tries to upload a new patch [19:58:10] vgutierrez: worked for me in DNS repo [20:02:27] 10Beta-Cluster-Infrastructure, 10Community-Tech, 10Data-Persistence (work done), 10MediaWiki-extensions-Phonos: Failed to create storage directory on Beta Cluster - https://phabricator.wikimedia.org/T317195 (10MusikAnimal) [20:02:35] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10SRE-swift-storage, 10Community-Tech (CommTech-Sprint-36), 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10MusikAnimal) 05Stalled→03Resolved There's nothing to... [20:03:21] vgutierrez: looks happy but I see exactly one "internal server error" at the time you asked [20:04:10] ^ +1 I see: java.util.concurrent.ExecutionException: com.google.gerrit.exceptions.StorageException: Can't insert change/patch set for operations/puppet [20:04:13] https://gerrit.wikimedia.org/r/monitoring - ERROR [com.google.gerrit.sshd.BaseCommand] Internal server error (user ki account 166) during git-upload-pack 'mediawiki/core' [20:05:03] concurrent storage exception: I'm hoping that means some strange fault in the stars where the repo was locked for a second and there's a bad error message. [20:19:47] 10Phabricator, 10Release-Engineering-Team (Bonus Level 🕹️), 10serviceops-collab, 10Patch-For-Review: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250 (10Dzahn) [20:20:36] there is always more stuff like ..SPF records for phabricator email.. and since they are only v6 IPs you dont see them with grepping for host name: [20:20:39] https://gerrit.wikimedia.org/r/858412 [20:20:56] replaces phab2001 with phab2002 in SPF records [20:31:17] 10GitLab (Auth & Access): Set new owner in wmit-wikimedia GitLab group - https://phabricator.wikimedia.org/T323196 (10DAIiDAOXING) [20:32:24] 10GitLab (Administration, Settings & Policy), 10serviceops-collab: Configure a default cleanup policy for GitLab package registry - https://phabricator.wikimedia.org/T315877 (10DAIiDAOXING) [20:46:43] so hmm `[A-Z]` is wrong! One wants `\p{Lu}` [20:50:46] one wants `[[:upper:]]` [20:51:29] meta/external-idsu=)$ git grep gerrit:.*[[:upper:]] [20:51:29] 0e/9f1ce0b97ba6f98633efb889ade8e90704d797:[externalId "gerrit:Ál"] [20:51:29] 48/546c9c45f1c4dc72928178958a74d5a4747953:[externalId "gerrit:Истенный"] [20:51:29] ba/8c6fcd6e0dd090cce1c589d303a1943810d3aa:[externalId "gerrit:ԱշոտՏՆՂ"] [20:51:29] thcipriani: you missed some users :-] [20:51:58] huh [20:53:49] I wonder what tr would do to those? [21:04:21] $ echo ŽELJKO |tr '[:upper:]' '[:lower:]' [21:04:21] Željko [21:04:25] fails :) [21:08:21] 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), 10Scap: Ensure efficient Gitlab CI operations for scap - https://phabricator.wikimedia.org/T323140 (10dduvall) [21:08:34] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃): Try DigitalOcean object storage for buildkit caching - https://phabricator.wikimedia.org/T323147 (10dduvall) 05Open→03Declined The buildkitd deployment running in cloud-runner now has access to the DO Spaces credentials.... [21:12:42] 10GitLab (Administration, Settings & Policy), 10serviceops-collab: Configure a default cleanup policy for GitLab package registry - https://phabricator.wikimedia.org/T315877 (10bd808) [21:12:58] 10GitLab (Auth & Access): Set new owner in wmit-wikimedia GitLab group - https://phabricator.wikimedia.org/T323196 (10bd808) [21:49:30] 10Release-Engineering-Team (Priority Backlog 📥), 10Epic, 10Release Pipeline (Blubber): Deprecate Blubber's CLI and microservice (blubberoid) interfaces - https://phabricator.wikimedia.org/T318289 (10bd808) [21:52:35] 10Scap: Add support for "gerrit/r/:changenum" URLs to scap-backport command - https://phabricator.wikimedia.org/T323320 (10Krinkle) [21:52:45] 10Scap, 10Developer Productivity: Add support for "gerrit/r/:changenum" URLs to scap-backport command - https://phabricator.wikimedia.org/T323320 (10Krinkle) p:05Triage→03Low [21:55:48] 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), 10Scap, 10Developer Productivity: Add support for "gerrit/r/:changenum" URLs to scap-backport command - https://phabricator.wikimedia.org/T323320 (10dancy) [21:59:28] 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), 10Scap: Ensure efficient Gitlab CI operations for scap - https://phabricator.wikimedia.org/T323140 (10dancy) [21:59:34] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃): Try local directory export + GitLab cache for buildkit caching - https://phabricator.wikimedia.org/T323150 (10dancy) 05Open→03Declined [22:05:18] 10Release-Engineering-Team (Priority Backlog 📥), 10Epic, 10Release Pipeline (Blubber): Deprecate Blubber's CLI and microservice (blubberoid) interfaces - https://phabricator.wikimedia.org/T318289 (10bd808) > [] Refactor Blubber internally to construct its build graph using BuildKit LLB and remove Dockerfile... [22:16:40] 10Scap, 10Developer Productivity: Consider reducing output in pre-sync phase of scap-backport - https://phabricator.wikimedia.org/T323325 (10Krinkle) [22:19:50] 10Scap, 10Developer Productivity: Consider reducing output in pre-sync phase of scap-backport - https://phabricator.wikimedia.org/T323325 (10Krinkle) [22:19:55] 10Scap, 10Developer Productivity: Consider reducing output in pre-sync phase of scap-backport - https://phabricator.wikimedia.org/T323325 (10Krinkle)