[05:57:35] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10mwbot-rs, 10mwcli: GitLab CI jobs failing with "You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" - https://phabricator.wikimedia.org/T329216 (10Legoktm) m... [08:46:02] 10GitLab (CI & Job Runners), 10observability, 10serviceops-collab, 10Patch-For-Review: node exporter for gitlab-runner hosts is missing metrics for /var/lib/docker - https://phabricator.wikimedia.org/T329286 (10Jelto) 05Open→03Resolved p:05Triage→03Medium a:03Jelto Metrics for `/var/lib/docker` a... [08:49:27] 10Gerrit, 10Pywikibot: 500 server error when pulling Pywikibot i18n - https://phabricator.wikimedia.org/T329452 (10binbot) Unfortunately git does not show, which server causes the error. Is a it different repo? [08:51:23] 10Phabricator, 10Trust-and-Safety, 10cloud-services-team, 10wikitech.wikimedia.org: Reset 2FA for Developer account 'Rosalie Perside (WMDE)' and Phabricator account @Rosalie_WMDE - https://phabricator.wikimedia.org/T329179 (10Rosalie_WMDE) Thank you [09:24:38] 10GitLab (CI & Job Runners), 10serviceops-collab: add disk space usage to grafana dashboard for gitlab-runners - https://phabricator.wikimedia.org/T327435 (10Jelto) 05Open→03Resolved a:03Jelto All gitlab-runners dashboards have disk usage for `/` and `/var/lib/docker` now. Overview Trusted Runners: http... [09:40:36] 10Beta-Cluster-Infrastructure, 10Cassandra, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: Can not log in, log out, or save edits to the beta cluster (session failures) - https://phabricator.wikimedia.org/T324128 (10noarave) 05Resolved→03Open This seems to be happening again - the WikibaseLexeme se... [09:48:36] 10Gerrit, 10Pywikibot: 500 server error when pulling Pywikibot i18n - https://phabricator.wikimedia.org/T329452 (10Peachey88) [10:01:40] 10GitLab (Infrastructure), 10serviceops-collab: Migrate gitlab-test instance to bullseye - https://phabricator.wikimedia.org/T318521 (10Jelto) 05Open→03Resolved I deleted `gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud`. I'm closing this task, migration to bullseye complete. [10:52:24] looks like T324128 is happening again, don't fully understand the previous resolution on the task :( [10:52:25] T324128: Can not log in, log out, or save edits to the beta cluster (session failures) - https://phabricator.wikimedia.org/T324128 [12:40:59] 10Phabricator, 10serviceops-collab: create aphlict2001 (Phabricator realtime notifications codfw) - https://phabricator.wikimedia.org/T322369 (10eoghan) a:03eoghan [12:43:07] 10Phabricator, 10Trust-and-Safety, 10cloud-services-team, 10wikitech.wikimedia.org: Reset 2FA for Developer account 'Rosalie Perside (WMDE)' and Phabricator account @Rosalie_WMDE - https://phabricator.wikimedia.org/T329179 (10Rosalie_WMDE) @bd808 running `ssh bastion.wmcloud.org` gives me `permission denie... [13:12:34] 10Phabricator, 10serviceops-collab, 10Patch-For-Review: create aphlict2001 (Phabricator realtime notifications codfw) - https://phabricator.wikimedia.org/T322369 (10eoghan) It seems that we do three things here, since the VM is already created and shut down: - [ ] Add puppet role (not many changes to make b... [13:40:39] maintenance-disconnect-full-disks build 465148 integration-agent-docker-1039 (/: 29%, /srv: 99%, /var/lib/docker: 34%): OFFLINE due to disk space [13:45:34] maintenance-disconnect-full-disks build 465149 integration-agent-docker-1039 (/: 29%, /srv: 80%, /var/lib/docker: 33%): RECOVERY disk space OK [13:52:21] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10mwbot-rs, 10mwcli: GitLab CI jobs failing with "You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" - https://phabricator.wikimedia.org/T329216 (10Addshore)... [14:15:32] 10Continuous-Integration-Config, 10MediaWiki-Configuration: diffConfig no longer detecs any changes in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T329518 (10Lucas_Werkmeister_WMDE) [14:16:57] 10Continuous-Integration-Config, 10MediaWiki-Configuration: diffConfig no longer detecs any changes in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T329518 (10Lucas_Werkmeister_WMDE) p:05Triage→03High Boldly making this High priority, since the config diff is quite useful when assess... [14:24:28] 10Continuous-Integration-Config, 10MediaWiki-Configuration: diffConfig no longer detecs any changes in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T329518 (10Lucas_Werkmeister_WMDE) It looks like one `buildConfigCache.php` call was removed in [multiversion: Create dblist-manage command... [14:45:38] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10ops-monitoring-bot) Cookbook cookbooks.sre.gitlab.upgrade was started by jelto@cumin1001 on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade... [15:12:47] 10GitLab: Change my 'full name' in GitLab - https://phabricator.wikimedia.org/T329057 (10xcollazo) All right, [[ https://wikitech.wikimedia.org/w/index.php?title=Help%3ACreate_a_Wikimedia_developer_account&diff=2053278&oldid=2050208 | added some guidance ]] to the onboarding template around the `username` field... [15:13:54] 10GitLab: Change my 'full name' in GitLab - https://phabricator.wikimedia.org/T329057 (10xcollazo) [15:14:53] 10Gerrit, 10SRE, 10LDAP: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792 (10xcollazo) [15:20:43] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10ops-monitoring-bot) Cookbook cookbooks.sre.gitlab.upgrade started by jelto@cumin1001 on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade Git... [15:21:23] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10Jelto) 05Open→03Resolved The last improvement was added to the `sre.gitlab.upgrade` cookbook. Cookbook runs display a notification for GitLab users du... [15:23:08] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10Jelto) [15:57:19] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10Data Pipelines, 10Data-Engineering-Planning, 10serviceops-collab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10dancy) 05Open→03Resolved a:03dancy @JAllemandou This should be resolved... [16:12:48] (03PS1) 10Zoranzoki21: Zuul: [mediawiki/extensions/SemanticDrilldown] Archive extension [integration/config] - 10https://gerrit.wikimedia.org/r/888748 (https://phabricator.wikimedia.org/T327578) [16:19:46] 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10Scap: Scap: Don't transmit "aborted" message to IRC if no prior announcement has been made - https://phabricator.wikimedia.org/T329228 (10thcipriani) [16:19:48] 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Kokkuri should allow dockerfile.v0 frontend - https://phabricator.wikimedia.org/T326569 (10thcipriani) [16:19:55] 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10Scap, 10Patch-For-Review: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277 (10thcipriani) [16:21:03] 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Add Gitlab JWT support to Reggie - https://phabricator.wikimedia.org/T323394 (10thcipriani) [16:21:47] 10GitLab, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Make a tool to convert .pipeline/config.yaml to .gitlab-ci.yaml - https://phabricator.wikimedia.org/T327332 (10thcipriani) p:05Triage→03Medium [16:22:15] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Mitigate thundering herd on GitLab runners - https://phabricator.wikimedia.org/T327416 (10thcipriani) [16:23:57] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10Patch-For-Review, 10User-brennen: Migrate mediawiki/tools/release/ to GitLab - https://phabricator.wikimedia.org/T290260 (10thcipriani) [16:24:24] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (Priority Backlog 📥), 10serviceops-collab, 10Patch-For-Review: Automate integration Jenkins deployment and config changes - https://phabricator.wikimedia.org/T319406 (10thcipriani) [16:24:42] 10Gerrit, 10Release-Engineering-Team (Priority Backlog 📥), 10SRE-OnFire, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani) [16:25:14] 10GitLab (Integrations), 10Phabricator, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10User-brennen: Build a widget to display GitLab changes on related Phabricator tasks - https://phabricator.wikimedia.org/T324149 (10thcipriani) p:05Triage→03Medium [16:25:46] 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): Try deploying buildkitd as a GitLab CI service - https://phabricator.wikimedia.org/T329213 (10thcipriani) [16:25:50] 10GitLab (Integrations), 10Phabricator, 10Release-Engineering-Team (GitLab V: Event Horizon 🌄): GitLab comments should come from a GitLabBot instead of gerritbot - https://phabricator.wikimedia.org/T327424 (10thcipriani) [16:25:58] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10User-brennen: Add DigitalOcean resource monitoring for cloud runner nodes - https://phabricator.wikimedia.org/T308615 (10thcipriani) [16:26:04] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10User-brennen, 10User-dduvall: Write a GitLab "Migrating a Project" runbook / manual based on Blubber migration - https://phabricator.wikimedia.org/T307538 (10thcipriani) [19:10:14] Project beta-update-databases-eqiad build #65130: 04STILL FAILING in 24 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/65130/ [19:16:20] Yippee, build fixed! [19:16:21] Project beta-code-update-eqiad build #430758: 09FIXED in 6 min 30 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/430758/ [19:19:40] Project beta-scap-sync-world build #90329: 04FAILURE in 1 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90329/ [19:20:04] Project beta-update-databases-eqiad build #65131: 04STILL FAILING in 3.4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/65131/ [19:25:38] Project beta-scap-sync-world build #90330: 04STILL FAILING in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90330/ [19:35:41] Project beta-scap-sync-world build #90331: 04STILL FAILING in 51 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90331/ [19:41:08] !log zabe@deployment-deploy03:~$ sudo keyholder arm [19:46:33] Project beta-scap-sync-world build #90332: 04STILL FAILING in 1 min 47 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90332/ [19:55:32] maintenance-disconnect-full-disks build 465223 integration-agent-docker-1029 (/: 28%, /srv: 95%, /var/lib/docker: 54%): OFFLINE due to disk space [19:55:50] Project beta-scap-sync-world build #90333: 04STILL FAILING in 1 min 2 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90333/ [19:57:09] !log zabe@deployment-deploy03:~$ sudo keyholder arm [19:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:00:32] maintenance-disconnect-full-disks build 465224 integration-agent-docker-1029 (/: 28%, /srv: 55%, /var/lib/docker: 53%): RECOVERY disk space OK [20:05:58] Project beta-scap-sync-world build #90334: 04STILL FAILING in 1 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90334/ [20:17:56] it's now failing since mariadb is dead on deployment-db10, which seems to be the cause because the volume is not correctly mounted [20:20:03] Project beta-update-databases-eqiad build #65132: 04STILL FAILING in 3 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/65132/ [20:26:22] Project beta-scap-sync-world build #90335: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90335/ [20:37:00] Project beta-scap-sync-world build #90336: 04STILL FAILING in 1 min 45 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90336/ [20:46:17] Project beta-scap-sync-world build #90337: 04STILL FAILING in 1 min 24 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90337/ [20:46:50] zabe: still seeing issues with deployment-db10? [20:55:51] Project beta-scap-sync-world build #90338: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90338/ [21:05:56] Project beta-scap-sync-world build #90339: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90339/ [21:16:08] Project beta-scap-sync-world build #90340: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90340/ [21:17:46] 10Gerrit, 10serviceops-collab: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Dzahn) Seems to me like this is 2 issues: a) ldapauth-gitldap.wmflabs.org is not working as an LDAP provider (outside of the Gerrit instance) b) https://gerrit.devtools.wmflabs.org/ is currently down... [21:20:05] Project beta-update-databases-eqiad build #65133: 04STILL FAILING in 4.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/65133/ [21:26:06] Project beta-scap-sync-world build #90341: 04STILL FAILING in 1 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90341/ [21:35:59] Project beta-scap-sync-world build #90342: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90342/ [21:45:58] Project beta-scap-sync-world build #90343: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90343/ [21:50:16] Those beta scap failures are "/usr/local/bin/mwscript purgeMessageBlobStore.php" crashing with an error about there being no aawiki.revision table. Not sure if that's a db startup problem or a new misconfiguration for beta [21:51:07] * bd808 sees zabe and taavi talking about dbs in backscroll [21:52:10] taavi: I have a hunch that deployment-db10 is having whatever problem that was bing talked about in -cloud earlier [21:54:07] bd808: if you mean the mariadb data directory location issue, then I know that's a different issue than what deployment-db10 has at the moment. [21:54:48] "doesn't exist in engine" is not an error that I've heard before, but it sounds very not fun. [21:54:56] bad guessing by me then :) [21:55:53] Project beta-scap-sync-world build #90344: 04STILL FAILING in 1 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90344/ [21:57:11] the internet suggests that that error is about data corruption, so I think I'm going to let someone else deal with it given how late it is here. [22:00:52] I see aawiki/revision.* files on disk. [22:00:53] 10Gerrit, 10serviceops-collab: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Ameisenigel) b) is probably related to T329535 [22:01:38] 10Gerrit, 10serviceops-collab: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Dzahn) step 1.. VM gerrit-prod-1001 (which that host name points to) could not be reached via SSH (maybe because of the general cloud outage today) and rebooting the instance brought that back [22:03:28] taavi, sorry, went afk after my message, it seeems like the volume is back [22:04:55] 10Gerrit, 10serviceops-collab: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Dzahn) https://gerrit.devtools.wmflabs.org/ is back with the described TLS error. step 2: puppet was disabled on the instance about 17 days ago with reason "gerrit deploy". I reactivated that. [22:05:00] zabe: "ERROR 1932 (42S02): Table 'aawiki.revision' doesn't exist in engine" is the problem now. There are data files on disk for that table, so something is keeping them from being read. [22:05:53] Project beta-scap-sync-world build #90345: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90345/ [22:06:16] zabe: There are also warnings on startup about "Unable to load replication GTID" [22:06:39] zabe: If you have time to poke at this I have other things that are begging for my attention [22:08:02] i can take a quick look [22:16:00] Project beta-scap-sync-world build #90346: 04STILL FAILING in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90346/ [22:20:02] Project beta-update-databases-eqiad build #65134: 04STILL FAILING in 2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/65134/ [22:20:48] 10Gerrit, 10serviceops-collab: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Dzahn) I fixed the certificate issue by: - installing package `python3-certbot-apache`. This is a the plugin to do the renewal challenge via apache httpd. In the past this was not a separate package.... [22:23:46] 10Beta-Cluster-Infrastructure: deployment-db10 databases are broken - https://phabricator.wikimedia.org/T329577 (10Zabe) [22:25:51] Project beta-scap-sync-world build #90347: 04STILL FAILING in 1 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/90347/ [22:26:00] looks like there is some data corruption in deployment-db10 [22:29:35] !log Disabled beta-scap-sync-world and beta-update-databases-eqiad Jenkins jobs [22:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:50:26] 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Dzahn) @Ameisenigel While there are some other cleanups we should do here.. can you try what happens if you click the "Sign Up" link now that it's back? it links me to https://w... [22:56:59] 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Ameisenigel) Yes, I can sign in with my regular developer account. [23:08:09] 10Beta-Cluster-Infrastructure: deployment-db10 databases are broken - https://phabricator.wikimedia.org/T329577 (10Zabe) [23:16:48] 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Issues with Gerrit test instance - https://phabricator.wikimedia.org/T329444 (10Dzahn) @Ameisenigel Unfortunately it's against TOU if we do that, so I had to shut it down for the moment. Can I ask what you wanted to test with the login? [23:20:04] created backup of all databases on deployment-db09 # T329577 [23:20:04] T329577: deployment-db10 databases are broken - https://phabricator.wikimedia.org/T329577 [23:20:21] !log created backup of all databases on deployment-db09 # T329577 [23:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:27:19] 10Beta-Cluster-Infrastructure: deployment-db10 databases are broken - https://phabricator.wikimedia.org/T329577 (10Zabe) I won't try recovering deployment-db10, I will just create deployment-db11 as a replacement. [23:35:45] !log create deployment-db11 as g3.cores8.ram16.disk20 # T329577 [23:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:35:47] T329577: deployment-db10 databases are broken - https://phabricator.wikimedia.org/T329577 [23:44:53] !log shutoff deployment-db10 # T329577 [23:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:44:56] T329577: deployment-db10 databases are broken - https://phabricator.wikimedia.org/T329577 [23:48:06] !log create volume db11 and attach to deployment-db11 # T329577 [23:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL