[01:04:41] (Queue (Jenkins jobs + Zuul functions) alert) firing: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org [01:15:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:32:58] 10MediaWiki-Releasing, 10Security: Tracking bug for MediaWiki 1.35.5/1.36.3/1.37.1 - https://phabricator.wikimedia.org/T292227 (10Reedy) 05In progress→03Resolved a:03Reedy [01:33:04] 10MediaWiki-Releasing, 10Security: Release MediaWiki 1.35.5/1.36.3/1.37.1 - https://phabricator.wikimedia.org/T292226 (10Reedy) [01:43:55] I’m not sure if anyone’s still up, but I pushed a bunch of Termbox changes (late at night, hoping to avoid disturbing others), and apparently Zuul hasn’t even started running them yet [01:44:08] no builds since Dec 20 at https://integration.wikimedia.org/ci/job/trigger-termbox-pipeline-test/ [01:44:42] if they don’t recover by themselves until tomorrow, feel free to just cancel the builds, at this stage I’m not interested in running CI for these changes [01:44:49] I just wanted to have them on Gerrit [01:45:22] (I still have nine more commits locally but I won’t push them for now) [01:50:53] As https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709125 is "Merge Conflict"... [01:52:58] Is CI doing anything at all atm? [01:54:49] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10Reedy) [01:55:04] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10Reedy) p:05Triage→03High [01:55:05] Reedy: the only thing happening in Jenkins is a scap in deployment-prep... [01:55:33] the Gearman job queue has gone rather high [01:57:35] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10Reedy) [01:58:10] * bd808 tries to remember things he used to know about debugging zuul [01:58:36] https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Debugging [02:02:53] apparently I can't do much. and what I can do looks weird. [02:02:58] 4026 zuul 20 0 339984 51908 4368 S 97.4 0.1 22736:40 zuul-merger [02:03:38] `sudo /usr/sbin/service zuul status` on contint1001 says the unit is masked and dead...? [02:04:21] I think Lucas has upset it by depending on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709125 [02:04:22] * bd808 wanders off to eat dinner [02:06:15] I think it's a weird cross repo dependancy that shouldn't exist [02:15:55] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10Reedy) I think Lucas upset it with his stack of changes of nearly 20 patches... ` 2021-12-22 02:10:07,810 DEBUG zuul.Repo: Resetting repository /srv/zuul/git/mediawiki... [02:34:47] Reedy: contint2001.wikimedia.org is apparently the active server. At least that's where zuul-server is actually running [02:35:13] should I restart it? [02:36:04] legoktm: It's not doing anything useful atm... [02:36:20] legoktm: if you know how and are comfortable doing that, yes I think [02:36:59] yeah, it's just a systemctl restart [02:37:43] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10bd808) contint2001.wikimedia.org is apparently the active zuul host. That should maybe be added to https://www.mediawiki.org/wiki/Continuous_integration/Zuul? [02:38:26] done [02:38:32] any patches need to be "recheck"ed [02:40:33] ah, there goes the one Reedy rechecked [02:40:43] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:40:49] ohai icinga-wm [02:40:52] yeah, it was a little slow to kick off [02:41:16] yeah I was tailing the logs and it just kind of sat there for a minute or two [02:41:29] "wut, where has my backlog gone?!" [02:42:55] 2021-12-21 23:38:01,776 ERROR zuul.MutexHandler: Held mutex mwcore-codehealth-master-non-voting being released because the build that holds it is complete [02:42:55] 2021-12-21 23:38:01,780 ERROR zuul.MutexHandler: Mutex can not be released for in postmerge> which does not hold it [02:44:38] not sure if the times are right, but that's all there is in the error.log [02:44:41] (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org [02:45:54] also stuff like 2021-12-22 02:36:44,171 WARNING zuul.Scheduler: Build set in test> #builds: 0 merge state: PENDING> is not current [02:48:22] * bd808 returns to cursing at golang things [02:48:25] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10Legoktm) I looked in the error and debug logs and didn't really see anything noteworthy: `name=error.log 2021-12-21 23:38:01,776 ERROR zuul.MutexHandler: Held mutex mw... [02:48:38] * legoktm heads to dinner, ping if anything else is needed [03:04:41] (Queue (Jenkins jobs + Zuul functions) alert) resolved: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org [03:25:52] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10Krinkle) I've stripped most contint1001 references on mw.org and updated Wikitech to mention this alias: https://wikitech.wikimedia.org/wiki/Contint [03:29:38] contint1001 (Redirected from Contint) [03:57:59] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10thcipriani) I just did a recheck and it seems to catch it on contint2001, and I see zuul processing the queue in `/var/log/zuul/debug.log` and I see things merging in `... [03:58:10] well. I looked at my phone: I see the recheck I just did moving and services look like they're running. [03:58:15] I'm going to check a core patch [04:04:16] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10thcipriani) 05Open→03Resolved a:03thcipriani Ran a recheck on a core patch. This looks resolved. If zuul stalled out, [[ https://www.mediawiki.org/wiki/Continuou... [04:04:51] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10thcipriani) a:05thcipriani→03Legoktm [04:21:53] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 (10AntiCompositeNumber) I've rechecked everything that looked like it got dropped during the restart. [06:41:17] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:42:23] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:48:26] 10Beta-Cluster-Infrastructure: deployment-mx02 puppet failure - https://phabricator.wikimedia.org/T294194 (10Majavah) 05Open→03Resolved Someone seems to have fixed this at some point. [11:05:48] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [extensions/GoogleDocCreator] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/749513 [11:05:50] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [extensions/GoogleDocCreator] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/749513 (owner: 10QChris) [11:05:52] (03PS1) 10QChris: Import done. Revoke import grants [extensions/GoogleDocCreator] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/749514 [11:05:54] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [extensions/GoogleDocCreator] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/749514 (owner: 10QChris) [11:25:38] 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10SRE, 10Puppet: Setup cron for foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldLogIPData.php on Beta - https://phabricator.wikimedia.org/T187658 (10Majavah) 05Open→03Resolved This was done some point as `mediaw... [12:07:34] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Radar), 10Security-Team, 10serviceops, and 2 others: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Dzahn) I agree that option 1 sounds misleading and not great and option 5 sounds overly complex / brittle.... [14:33:44] 10Project-Admins: Create project tag for User-Raymond_Ndibe - https://phabricator.wikimedia.org/T298195 (10Bugreporter) [16:54:41] 10Release-Engineering-Team, 10MW-on-K8s: mediawiki-multiversion image builder should also poll private and security patches git repositories - https://phabricator.wikimedia.org/T298165 (10dancy) [16:55:23] 10Release-Engineering-Team, 10MW-on-K8s: mediawiki-multiversion image builder should also poll private and security patches git repositories - https://phabricator.wikimedia.org/T298165 (10dancy) Dealing with private settings will require a bit more setup. [17:16:36] 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10serviceops, 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) I made this new page that shows all fingerprints in a cen... [17:18:32] 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10serviceops, 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) {F34892980} [17:20:21] 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10serviceops, 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) [17:21:56] 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10serviceops, 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) The part we haven't talked about yet is that also for the... [18:37:03] (03PS1) 10BryanDavis: feature: build-time arguments for lives & runs user config [blubber] - 10https://gerrit.wikimedia.org/r/749569 (https://phabricator.wikimedia.org/T296046) [19:22:32] 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10serviceops, 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Legoktm) f it has different sets of keys for the same hostnames... [19:37:14] 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10serviceops, 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) Yea, well.. unless you argue "if we switch over to anothe... [20:39:04] (03CR) 10BryanDavis: "See Ia2ff7247c24ba3bcd43d90a900eeec8cd2e73ec2 for a diff showing how this patch changes the Dockerfile output and a practical use of the b" [blubber] - 10https://gerrit.wikimedia.org/r/749569 (https://phabricator.wikimedia.org/T296046) (owner: 10BryanDavis) [21:59:03] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook