[04:51:09] <_joe_> Krinkle: there should be more stuff, but basically if people use the service template for nodejs services it takes care of 90% of the guidelines itself [06:22:16] hey slyngs are you the one with the cron -> systemd timer patches? if so you have a +1 on one patch (can be merged whenever) and a -1 on the other (missing description) for the dumps related ones, no rush, in theory I am in a cross-team hackathon this week but in practice I am preparing for a presentation for a local meetup for the WIkimedia hackathon the following week :-) [06:22:29] wow that message was too long, if it got truncated, tell me where... [06:23:29] Yes, I saw the comment in Gerrit. I'll get the missing description sorted. [06:23:53] 👍 [11:09:44] Can someone tell me what stupid thing I'm doing wrong, please? Tried to use the reboot-single cookbook thus 'sudo cookbook sre.hosts.reboot-single -t T307668 --depool ms-fe1010.eqiad.wmnet' [11:10:01] and get the error 'phabricator.APIError: ERR-CONDUIT-CORE: Monogram "T307668" does not identify a valid object.' [11:10:25] Oh, is this that I can't use a security phab item in a reboot note? [11:10:33] that's a private task [11:11:14] You can just add the task in the reason string stead [11:11:24] *instead [11:14:42] Emperor: yes the api key used by spicerack dosen;t have permission to that task [13:48:46] herron: Just saw your message regarding the mx servers, I think we should consider downgrading the kernel to see if that resolves the issue [13:49:04] jhathaway: +1 sgtm [13:50:49] herron: okay, working on that nowish [13:51:42] great thx, I'll also create an incident doc now to capture deets since this seems to have some user affect [13:51:52] herron: thanks [13:55:33] https://docs.google.com/document/d/18DuwBH9Ejsej6ENyCZE723MGbtgaa80vDTKTz-7J858/edit# [13:58:13] herron: doesn't seems to have resolved the issue, still seeing them. [13:59:26] hmm, I wonder if this in an ipv6 issue [13:59:59] jhathaway: ack [14:03:11] jhathaway: +1 on rebooting back to the old kernel [14:07:36] moritzm: is this is expected kernel version to revert back to? mx1001 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 [14:09:30] herron: I am pretty sure that was the prior kernel we were running [14:22:32] herron: yeah, pretty sure that's what we've been running before [14:23:02] there was also 5.10.103-1 released before [14:23:16] but not sure if we track the old running kernel version in the metrics [14:23:31] but if this still shows up with 106 we can still revisit 103 [14:24:26] moritzm: ok, jhathaway did revert to 106 and we are still seeing the issue, +1 to revisiting 103 here [14:24:52] moritzm: we do -- https://w.wiki/598C [14:25:48] cdanis: thanks [14:26:51] thanks Chris, that means we moved from 5.10.92 -> 5.10.113 [14:27:40] going to reboot again.. [14:28:03] fingers crossed [14:28:06] that dashboard should maybe report the combination of kernel ABI and version (or just the version), 5.10.0-11 can be ambigious if there are two kernel releases which share the same ABI [14:28:13] to get 5.10.92, though I think the issue is elsewhere [14:28:23] the ABI changes pretty often, so it's not a big deal [14:28:39] but using the version might be better in general [14:29:12] seems like our MXes are finding all the 5.10.x regressions :-) (if the current one confirms) [14:29:35] if the roll back doesn't help I wonder if we adding a chunking_advertise_hosts hack in exim would help stop the immediate errors [14:30:26] if 5.10.92 fixes it we can do a poor man's bisect by trying 5.10.103 and 5.10.106 which were also released in between [14:30:35] yeah [14:30:59] herron: were you able to tell if *new* mail since the reboots was affected? [14:31:21] or only queued mail [14:31:29] jhathaway: I've been tailing the log and seeing matching against BDAT command used when CHUNKING not advertised on both mxes yeah [14:31:53] since these are 5xx I don't think we're dealing with queued from the remote end [14:32:12] exim just drops on 5xx? [14:32:40] afaict these are pre-queue errors, so users would be getting a mailer-daemon bounce [14:32:52] makes sense [14:41:50] Hello team, I have a question regarding the accepted algorithms for PGP keys. The documentation suggests creating a 4096 bit RSA key, I currently have an offline ECC main key and I create subkeys for my computers. [14:41:52] I can confirm, for the mail from the original report I have in fact gotten a bounce [14:42:08] I added my wikimedia identity to my key and created a subkey for my Wikimedia laptop. Does the keyring support ECC keys? [14:42:15] with a bounce like https://phabricator.wikimedia.org/P27766 [14:42:19] In case the keyring only supports RSA it may be possible for me to create an RSA subkey certified by my ECC key, please let me know if I should use an RSA subkey or if we can try the ECC key. :) [14:42:41] Here's my public key: https://keyserver.ubuntu.com/pks/lookup?op=vindex&fingerprint=on&exact=on&search=0x3DFF9745BCDE5F4FF3AABE0E13DD4991DD98B648 [14:43:30] jhathaway moritzm still seeing errors after the downgrade to 5.10.0-11-amd64 #1 SMP Debian 5.10.92-2 (2022-02-28) x86_64 :( [14:43:41] yeah [14:43:54] what do you think about next steps? [14:44:30] am I reading the logs correctly, that these are associated with the syntax error messages from exim? [14:44:39] any only coming in via ipv6? [14:45:06] jhathaway: seeing ipv6 only as well, and also looks like only google ips? I'm doing reverse lookups now [14:46:50] jhathaway: added that to the incident doc, yeah looks like all gmail hosts [14:46:57] maybe this is caused by a change at Google? we could reach out to ITS so that they loop in Gsuite support? [14:47:07] could be yeah good idea [14:47:10] moritzm: yeah I think that would be wise [14:47:27] I'm wondering if we typically receive mail from google via ipv6 [14:47:57] I also wonder if listing these hosts as chunking_advertise_hosts would help stop the immediate issue? [14:48:06] not sure off hand about side effects on that [14:48:41] 787538have only been skimming however we do normally get google mail over IPv6 [14:49:13] jbond: thanks [14:50:35] herron: not sure about chunking_advertise_hosts, reading the docs now, wouldn't that be for sending mail? [14:52:24] afaict it's advertising the CHUNKING capability outwards during the SMTP session, which I'm curious would pacify these "BDAT command used when CHUNKING not advertised" errors [15:00:33] herron: i found this which suggests that this error could be illicited if "the BDAT verb (after MAIL and RCPT) should yield: [15:00:41] On safe Exim, it should yield: [15:00:42] 503 BDAT command used when CHUNKING not advertised [15:00:53] https://seclists.org/oss-sec/2017/q4/324 [15:01:38] ikm not 100% as it could be read different ways but im potentially if we see the BDAT after mail or rcpt then we may send this error but it could also be a massive rabbit whole [15:01:46] chunking_advertise_hosts defaults to '*', and thus should be enabled for all hosts connecting to us [15:02:07] yes we defnetly advitise it to everything [15:02:08] you can check by manually connecting to mx1001, issuing an EHLO and seeing if you get back CHUNKING in the listed capabilities [15:02:44] in that sense the "when CHUNKING not advertised" seems weird and could perhaps point to a broken smtp conversation [15:02:52] like not issuing or misparsing the EHLO [15:04:39] thought i had allready added this to the ticket but must have missed it. either way he is the list of capabilities https://phabricator.wikimedia.org/T307873#7914136 [15:05:23] paravoid: thanks, confirmed that we do advertise it correctly [15:10:56] herron: from the logs it appears these messages started on 5/4? [15:11:09] 2022-05-04 01:28 was the first message [15:11:15] but I don't see any prior to that [15:11:51] so, to re-state the current status -- we continue to see "BDAT command used when CHUNKING not advertised" errors logged after reverting to the last known good kernel version. we are only seeing this error affecting google clients, which are connecting via ipv6. and we have confirmed that chunking capability is indeed advertised after ehlo via v4/v6 manual SMTP connections [15:12:18] +1 herron [15:12:26] perhaps add that not all gmail is affected [15:12:36] jhathaway: yes please double check me on that though [15:13:37] actually I am not sure, it could be all gmail [15:13:58] well anecdotally I just received a test email from my personal account, so not all gmail [15:13:59] are you sure that exim was reloaded when the last exim changes were pushed? [15:14:32] paravoid: pretty confident, but that is a good question [15:15:52] last change was around tainted data from aliases, and I reloaded to confirm we no longer received those messages in the logs [15:17:06] FYI see l;ast response i have been able to recreate the error over telnet [15:17:13] https://phabricator.wikimedia.org/T307873#7914201 [15:17:56] jbond: that is super helpful [15:18:03] jbond: yeah, no EHLO == no capabilities advertised -- that's what I meant with 18:02 < paravoid> like not issuing or misparsing the EHLO [15:18:06] sorry for not being more clear [15:18:33] ahh ok yes it dose work if i start with ehlo [15:20:15] there is usually some delay in exim emitting its banner, I wonder if the remote party sends EHLO before our banner and exim ignores it [15:20:42] that would mean it's a race, and could explain this being intermittent [15:20:54] it's a little far fetched though [15:21:57] does current SMTP standards dictate who speaks first? it seems like any such race would be a bug if so. [15:22:43] ISTR banner is meant to be first, BICBW [15:23:02] we set "helo_try_verify_hosts = *" I see [15:23:10] which is obsolete [15:23:29] replaced by "verify = helo" [15:23:35] paravoid: at least in manually testing, exim response with chunking support even if I EHLO first [15:23:39] before the banner [15:23:45] https://datatracker.ietf.org/doc/html/rfc5321#section-3 [15:24:12] ^ does seem to imply a session starts with the server's initial output first [15:25:29] hmm [15:25:38] is this over TLS? [15:26:02] paravoid: yes they issue starttls [15:26:24] a typical smtp convo with tls should be receiver: banner -> sender: EHLO -> RCV: capabilities including TLS -> sender: STARTTLS -> receiver: banner -> sender: EHLO [15:26:30] I wonder if the sender skips the second EHLO [15:26:40] C=EHLO,STARTTLS,EHLO,MAIL,RCPT,BDAT,RSET,NOOP,MAIL,RCPT,BDAT [15:26:55] thats an example command set for one of the failures [15:27:23] and the last bdat fails? [15:27:39] should we try disabling chunking? [15:27:47] paravoid: yes i belive so [15:28:24] weird [15:28:29] jhathaway: sounds reasonable as a workaround [15:29:14] +1 [15:29:41] +1 to try disabling chunking [15:30:10] ok, I am going to disable puppet on mx1001, and make the change manually as a test, 'chunking_advertise_hosts =' [15:31:36] +1 to disabling [15:32:10] ok so [15:32:27] EHLO,MAIL,RCPT,BDAT,RSET,NOOP,MAIL,RCPT,BDAT checks out [15:32:33] RSET resets the session, including capabilities [15:32:40] I just tried this sequence [15:33:25] so the first EHLO is responded to with CHUNKING, and the first BDAT is accepted; the RSET in the middle resets, and there is no EHLO following it, so the second BDAT fails [15:33:52] interesting, why would they issue a RSET [15:34:01] that is a good question [15:34:25] exim restarted on mx1001 [15:35:13] hm, I can't reproduce now on mx2001 [15:37:34] I am not seeing the log message on mx1001, going to disable chunking on mx2001 and restart? [15:37:35] so scratch the above, I can't reproduce [15:37:48] could be some intermittent state after the config change [15:38:54] config change made on mx2001 [15:39:01] jhathaway: seeing the same +1 [15:42:35] ok, I need to go [15:42:39] call if you need any help [15:43:15] thanks paravoid! [15:45:52] indeed, thanks paravoid [15:46:15] I think the immediate crisis is over herron [15:46:18] still looking good here, no errors in the logs [15:46:41] ha was just tying the same, yeah, thoughts on decreasing the task priority and considering the incident mitigated? [15:48:11] herron: +1 [15:57:43] {{done}} [15:58:18] still wonder what the root cause was here, interested in what google postmaster says [15:59:39] also looking back there are very few instances of this dating back to 2022-03-16, but the full scale didn't start until the 4th as you found [16:00:43] herron: yeah I definitely agree that some more digging is neeed, but I am happy that we have at least stauched the bleeding for the moment :) [16:07:56] maybe the earlier cases from March are from a sampled test change on the Google side, which was eventually enabled globally on the 4th [16:12:53] calling the incident mitigated sgtm [16:13:11] don't forget to puppetize the change [16:13:34] let me know if I can help in the postmortem follow up [16:14:10] thank you so much jhathaway, herron, jbond! [16:14:17] paravoid: will do thanks