[02:22:03] FIRING: PuppetFailure: Puppet has failed on thanos-be2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:22:03] FIRING: PuppetFailure: Puppet has failed on thanos-be2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:26:26] that's a failed disk; I'll get to it in a bit [08:33:25] FIRING: SystemdUnitFailed: export_smart_data_dump.service on ms-be1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:34] sigh, obviously it's hardware day :-/ [09:08:18] I am setting up a offline wikipidia "mirror", did it with the file wikipedia_en_all_nopic_2025-07.zim, wich is a textcopy, it works fine. Now I am looking for the intere data, including picture, is there a one file that should be downloaded, og many files? [09:28:25] RESOLVED: SystemdUnitFailed: export_smart_data_dump.service on ms-be1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:48] RESOLVED: PuppetFailure: Puppet has failed on thanos-be2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:53:48] FIRING: PuppetFailure: Puppet has failed on ms-be1074:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:55:00] Hmph, I just opened T409040 about the dead disk on that system [09:55:00] T409040: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040 [09:58:48] RESOLVED: PuppetFailure: Puppet has failed on ms-be1074:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:27:57] federico3: Can you review this? https://gerrit.wikimedia.org/r/c/operations/dns/+/1201016 a little bit urgent [10:43:32] yu [10:43:45] thanks [14:35:56] federico3: there are dbctl changes pending to be merged - icinga alerting [14:36:09] they are your es* codfw hosts [14:39:45] it's the decommissioning script currently running [14:39:55] let me check [14:40:01] but the dbctl changes are still pending to be merged [14:40:13] so I believe you merged the puppet patch and didn't run a dbctl config commit [14:40:17] then probably a race condition in the puppet merge [14:40:34] you have to merge them manually in dbctl [14:40:37] no script does that for you [14:40:41] https://www.irccloud.com/pastebin/3b0GNYhq/ [14:40:51] no, in cumin1003 [14:40:54] dbctl config commit [14:41:15] When you merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199742 kind of patch, you have to commit in dbctl [14:41:34] ah triggered by the merge into puppet, ok merging it now [14:41:41] yep [14:41:42] thanks [14:49:37] one last question, for Emperor or anyone else, assuming you run a transfer to 2 hosts, one fails with code -1 (firewall unable to be open) and another with code 2 (transfer failed in the middle), what should the exit code of the invocation should be? [14:52:44] I think I don't feel strongly about the error code (presumably you could be flash and make them all powers of two and AND them together) as long as both errors are clearly reported to the operator, if you see what I eman [14:53:01] it logs, yea [14:53:25] and from code it returns the individual error codes in an array [14:53:33] but on command line I only have 1 value [14:54:06] I think you don't use the command line, so I will figure something [14:54:12] as long as it does != 0 [14:54:17] yeah [15:24:18] I have so many doubts about when I should be bold and do something imperfect but improvable in the future vs when I should wait and not be impatient [15:40:52] Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1201089 please? These 3 nodes are 0-weight in the rings, so now need removing for disk controller swap. [15:43:02] checking [15:44:29] done [15:45:52] TY [16:20:09] Emperor: I tried to reimage ms-be1088 two times (with Trixie) and I didn't get the grub issue.. Anything special that needs to be done, or is it a matter of trying more times? [16:26:55] elukey: no, it seems entirely random (and perhaps affects some nodes more than others). Sorry, I realise that is a very unsatisfactory report. [16:28:24] it bit me particularly with ms-be1083 and sretest2010 IIRC [16:34:33] nono all good, I just wanted to know what to look for, and if there are differences [17:26:56] <_joe_> hey [17:27:04] <_joe_> there is an issue with a package you exported [17:27:09] <_joe_> 2025-11-03T12:34:01.671292+00:00 apt-staging2001 gitlab-package-puller[2242694]: Error: trying to put version '0.12.1~wmf6' of 'python3-wmfmariadbpy-remote' in 'trixie-wikimedia|main|amd64', [17:27:09] <_joe_> 2025-11-03T12:34:01.671409+00:00 apt-staging2001 gitlab-package-puller[2242694]: while there already is the stricly newer '0.12.1~wmf6+deb13u1' in there. [17:30:30] <_joe_> can someone please fix this? It's blocking importing other packages [17:30:48] package? [17:30:59] I didn't release that [17:31:00] <_joe_> else I'll go and remove the artifacts by hand myself [17:31:22] <_joe_> jynus: I assume it's built automatically by gitlab and uploaded to apt-staging [17:31:27] I think it's https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/ [17:31:28] <_joe_> if you merge a change to debian/changelog [17:32:07] weird, I haven't setup package building [17:32:14] maybe moritz did it? [17:32:36] oh, sorry [17:32:41] my apologies [17:32:44] wrong packag [17:32:45] e [17:32:56] it is wmfmariadbpy which I have noting to do with it [17:33:04] I confused it with wmfbackups [17:33:23] federico3 that one's for you ^ [17:33:51] apologies _joe_ with the confusion, because I updated my packages recently [17:34:14] thanks Emperor for the pointer [17:34:43] I think the version handling in https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/blob/trixie2/.gitlab-ci.yml?ref_type=heads may not be correct [17:35:41] is this a recent change? [17:35:54] AFAICT when you build version X you then backport it as versions X+wmfYu1 for each backport distro [17:36:22] but that's wrong, because as the error message says 0.12.1~wmf6+deb13u1 > 0.12.1~wmf6 [17:36:42] also 0.12.1~wmf6 > 0.12.1 which I suspect is also not what you meant [17:37:29] federico3 is currenly in the incident review meeting, so probably won't read this today now [17:37:30] indeed it's mine [17:37:33] looking [17:38:19] I wonder if uploading should require manual intervention? [17:38:44] generally the point of apt-staging is that it gets automatically-built packages from trusted runners [17:39:22] yes, I mean a human saying "yes, do it", while automating all other steps [17:40:01] I guess no need if one has to do a release anyway [17:40:01] That would defeat the point of at least some use cases. [17:40:39] anyway, let me know if I can help, I am not very familiar with the new CI setup yet [17:41:07] _joe_: where's that log coming from? [17:41:43] could it be related to https://phabricator.wikimedia.org/T408109 perhaps? [17:43:34] federico3: I think the problem is your version numbering scheme is incorrect. I think you don't want ~wmfx (since that's less than 0.12.1) for your versions after 0.12.1, and then when backporting, you want to use a strictly-lower version rather than the strictly-higher versions that you're currently using [17:45:28] federico3: also, it's not coherent to be building against trixie-wikimedia and then _also_ trying to backport to trixie. [17:45:42] your main d/changelog is set to trixie, and you then try and backport to trixie too [17:49:25] if I may, maybe the right way is to just unclog the queue in some way, and then a more long term fix tomorrow? [17:54:04] <_joe_> I tried to unclog the queue and it didn't work, but I think it can be done tomorrow morning [17:54:18] :-( [17:59:58] FWIW I can try disabling the backport and bump up the version to 0.12.1~wmf7 [18:03:42] OOI, why not 0.12.2? this is our software. [18:05:05] (but yes,I would expect bumping the version and stopping trying to both build-for-trixie and backport-to-trixie would work) [18:06:30] python3-wmfmariadbpy_0.12.1~wmf7_amd64.deb got built [18:06:44] Emperor: just because there are no source changes :) [18:08:24] Integers are cheap, and 0.12.2 would have been greater than 0.12.1 in a way that 0.12.1~wmfX wasn't. [18:11:49] I can bump to 0.12.2-wmf1 with dash [18:12:34] bumping 2 for clarity and switching to dash for sorting after .2 [18:15:03] on https://wikitech.wikimedia.org/wiki/Debian_packaging/Package_your_software_as_deb but the doc requires ~ for backports, uff [18:18:00] https://www.irccloud.com/pastebin/Fe30pY6P/ [18:18:15] well to unblock _joe_ this should be enough at the moment [18:24:36] federico3: yeah, so you want 0.12.2-wmf1 (say) for the main branch, and backports like 0.12.2-wmf1~deb11u1 [18:24:54] because backport versions should be smaller than the thing they are backported from [18:43:00] indeed, it should be ready in a minute [19:03:48] FIRING: PuppetFailure: Puppet has failed on ms-be1088:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure