[08:37:37] FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:39:07] that's the silence apropos T371192 expiring; I'll extend. [08:39:07] T371192: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192 [09:13:57] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:58] ^-- known/expected? [09:20:35] I think that is the host that keeps crashing, but was depooled (it is a spare) [09:27:28] it is expected, tracked under T375382, haven't had the time to get to it yet [09:27:29] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [09:27:56] I'll disable its notifications it so it doesn't spam [09:29:01] ah, it is already with notifications disabled [09:29:40] ah mybad [09:29:58] I'll reopen an issue as this one was closed, I think it slept through during the pc incident [09:30:37] nope: T373037#10246220 [09:30:38] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:14:09] Amir1: do you have a way/ticket/doc or something where I can see and possibly try to reproduce the auto inc issue, I want to report it to mariadb [12:14:35] marostegui: he is in meeting with me [12:14:38] will tell him [12:14:43] :) [12:15:28] I guess it is all at https://phabricator.wikimedia.org/T375652 but I want to make sure that's all [12:16:47] And also to confirm it happens on 10.6.19 and NOT in 10.6.17 [12:19:37] marostegui: yup, that's the only ticket and reclone on .17 fixed the "1" value on those tables [12:23:33] Amir1: thanks :) [12:28:30] Amir1: So to see if I am trying to reproduce this correctly: https://phabricator.wikimedia.org/P70924 and if that what was going back to 0? [12:29:36] can I get a quick review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087443 ? [12:32:26] dhinus: it looks good to me, but test it when pushed [12:32:39] yep! [12:32:41] thanks [12:39:10] applied to an-redacteddb1001 and tested: it works as expected. applying to clouddbs now [12:42:07] applied everywhere, works as expected [12:47:21] Amir1: those pastes about cu_log are totally random, I generated them myself :) [12:47:37] ah, got me worried :D [12:48:14] hahaha [15:31:27] Amir1: do you know if any part of the wikitech migration should make 2fa stop working? is that expected? [15:32:56] like, if my password is accepted, but authenticater code is not (and I've changed nothing there), is that a migration-related issue, or breakage of another variety? [15:36:37] btullis: are these `cassandra_hosts` needed for something? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/aqs-http-gateway/values-test.yaml#20 [15:37:54] they're wrong (they use the host's name, not the instance names); do they need to be fixed, or can they be omitted? [15:41:35] urandom: I haven't looked at this in a while, but I have a feeling that they may still be referenced in the chart: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/aqs-http-gateway/templates/_config.yaml#L28 [15:44:52] yeah, I was thinking that a value for `cassandra_hosts` (a working list) must be getting plugged in elsewhere [15:45:21] urandom: I think that you might do better to speak to someone on the Data Products team about this, since they are responsible for the app itself. There was some recent work done on T366157 where they removed a redundant config entry from the druid part of it. [15:45:22] T366157: aqs-http-gateway kubernetes chart improvements - https://phabricator.wikimedia.org/T366157 [15:46:06] Possibly sfaci (Santi) would be able to help? [15:47:21] btullis: sorry, I only asked you because you were the person that added these; I'll ping someone in data products [15:50:22] Apologies, I didn't mean to be unhelpful or defeatist. I know that we also did some work to update the network policies associated with these aqs apps so that we can remove the hard-coded servers from those parts. e.g. https://github.com/wikimedia/operations-deployment-charts/commit/939fa2fed36d04bce0948e41a600d75bab74e877#diff-d3d8fa12d850154c33c86d9ff51f8deb1bfccb2598461161517854d0c86ad1ea [15:51:09] However, I'm less familiar with the go based service itself. [15:52:41] hrmm... so you think these may serve some other purpose than connecting to Cassandra? [15:53:13] I'll open a ticket. [15:53:19] btullis: thanks again [15:56:33] I think that they serve a purpose other than opening the firewall to cassandra, so it might be the actual application configuration. Not sure though. Perhaps there is a new alias that could be used for service discovery for this cluster. Let's check with brouberol: [15:58:26] I would hope that the only reference to a host running Cassandra (in an application's config) would be for the interface that is actually running Cassandra. Anything else would seem like a bug (even if of a different kind). [15:59:35] i.e. I hope that's not happening, but I'll open a ticket for further followup [16:03:35] urandom: can you try the 2fa for SUL account? [16:03:55] Amir1: to be clear, like say on meta? [16:04:09] yeah [16:04:54] yeah, that works [16:05:29] > hrmm... so you think these may serve some other purpose than connecting to Cassandra? [16:05:29] these are references to Service resources allowing to configure network policies, cf https://wikitech.wikimedia.org/wiki/Kubernetes/Deployment_Charts#Select_what_services_you_want_to_enable_egress_to [16:06:06] and because these are Services, you can use them for service discovery, for ex kafka-jumbo-eqiad.external-services.svc.cluster.local [16:09:09] brouberol: to be clear, we're talking about this? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/aqs-http-gateway/values-test.yaml#16 [16:10:49] marostegui: too late your time I think but yeah, that should have caused the issue... [16:10:52] urandom: ah, no, apologies, I thought you were talking about https://github.com/wikimedia/operations-deployment-charts/commit/939fa2fed36d04bce0948e41a600d75bab74e877#diff-c708f318d78fa6cad92bc81ac53c45cebacd9d59a2fb731f6e5e70bf42931de2R30 [16:11:11] the link that btullis shared. If not, forget about my comment [16:12:05] Amir1: so I'm unable to reproduced [16:55:56] Amir1: I don't think I ever visited Special:PasswordReset (though I did link my account), and I can't do it now because I can't get past the 2fa, so maybe I messed this up? [16:56:10] probably should have had you migrate me when you offered [16:57:23] * urandom will follow up on T376267 tho [16:58:18] if you put it in phabricator, I will take care of it