[05:28:03] andrewbogott: did you create the task in the end? I don't see it on my mail :) [05:36:34] marostegui: I tried again a few minutes later and it worked -- probably I just didn't wait long enough after depooling for existing queries to finish. [05:36:49] ah cool! happy to hear :) [05:37:45] Thanks for following up! I'm about to go to sleep but will see you later for the M5 proxy thing [05:37:45] M5: Where should the Wikimedia usernames appear - https://phabricator.wikimedia.org/M5 [05:38:26] oh yes, thanks andrewbogott, sleep tight! [08:49:36] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service [10:00:37] :( [13:24:13] anyone fancy reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/737913 apropos T294380 ? This is step one of https://wikitech.wikimedia.org/wiki/Swift/How_To#Create_a_new_swift_account_(Thanos_Cluster) and go.dog is on holiday this week [13:24:13] T294380: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 [13:24:48] kormat: So the change that is up wouldn't work? [13:26:32] I will do the m5 proxy change again in like 30 minutes [13:31:56] I go eat something before the m5 proxy change [13:32:49] marostegui: the existing change will work, it's just ugly [13:34:18] kormat: But not having a page is even uglier! [13:34:27] kormat: and the for DC switch, would it be a mess? [13:35:43] I'm waiting for _joe_ to shout at me in #-sre when he has a chance [13:35:51] XD [13:56:20] * _joe_ yells at kormat [13:58:17] see?? [14:00:02] andrewbogott bd808 legoktm let's go for the m5 breakage? :) [14:00:13] I'm here [14:00:32] Give me a minute [14:00:46] sure! [14:00:47] * bd808 arrives [14:00:57] o/ [14:02:29] marostegui: I'm ready! [14:02:35] ok! [14:02:38] going to merge the dns change [14:02:51] deploying [14:04:10] deployed [14:04:43] toolhub seems to be working as hoped [14:05:14] legoktm: how's mailman doing? [14:05:33] o/ [14:05:36] My home internet just dropped [14:05:40] striker (toolsadmin) is working as well [14:05:49] sweeet [14:06:01] legoktm: for what is worth lists.wikimedia.org seems to be working [14:06:03] marostegui: mailman looks fine [14:06:07] yeah, lgtm, I can edit things on wikitech [14:06:24] andrewbogott: didn't we move wikitech to s6? [14:06:30] yes [14:06:35] oh you're right :) old habits [14:06:37] haha [14:06:40] I didn't get to restart it, so it's probably still connected to db1128 directly? [14:06:43] you scared me [14:06:46] but anyway, also striker is working [14:06:50] I can restart it [14:06:52] legoktm: let's try a restart if you can [14:07:29] legoktm: andrewbogott bd808 if for whatever reason during the EU night you see something not working fine and could be related all you have to do is revert https://gerrit.wikimedia.org/r/c/operations/dns/+/737837 and deploy the dns [14:07:40] sounds good [14:07:40] restarting [14:07:42] thank you marostegui ! [14:07:45] I just restarted [14:08:08] I can browse it fine so far [14:08:17] hyperkitty seems happy [14:08:35] \o/ [14:08:36] bd808: hyperkitty is always happy, she's "hyper" [14:08:40] * bd808 now wants to name something lazydog [14:08:44] hahaha [14:09:00] or slowturtle [14:09:04] bd808: rename mediawiki? [14:09:23] * bd808 lobs a trout towards Amir1 [14:09:30] :P [14:09:35] lgtm :) [14:09:54] more of a "terrible dinosaur" thou [14:10:08] Amir1: I still have not sent you stickers. Ugh. Sorry. I'll try to fix that soonish [14:10:18] So I think we can call this resolved? [14:10:26] haha, all good [14:10:30] I am going to also comment on the task how to revert this in case it's needed [14:10:40] marostegui: let me try a "write" operation in mailman [14:10:46] Amir1: sounds good [14:12:22] thank you for the work on this marostegui. And for playing the "find a day in November that is not a US holiday" dance too. :) [14:12:35] hahahah [14:12:41] thank you all for waking up early to test [14:12:43] much appreciated [14:14:08] I'm still trying to see the hyperkitty archives being written, they are not [14:14:19] https://lists.wikimedia.org/hyperkitty/list/test@lists.wikimedia.org/ [14:15:04] Amir1: if you know the table I can directly check on the DB [14:15:19] it's a mess backend, it's django [14:15:30] oh god [14:15:46] ORM [14:17:29] legoktm: did you get emails in test@? [14:17:58] not yet [14:18:26] I've sent two by now [14:18:36] I have sent one to our sre-data-persistence list [14:18:47] isn't it google groups? [14:18:51] ^^ [14:18:52] is it? [14:19:02] yup [14:19:06] :( [14:20:01] The last right on any mailman3 table was at 14:15 [14:20:11] *write [14:20:38] https://phabricator.wikimedia.org/P17717 [14:21:23] hmm [14:21:35] I emailed test@ too, and it hasn't been received by mailman yet [14:22:01] is there a log or something where we can check? [14:22:11] I checked, there is no error AFAICS [14:22:21] root@lists1001:/var/log/mailman3/web [14:22:57] now we got something WARNING 2021-11-10 14:22:23,055 572 django.request Forbidden: /hyperkitty/list/test@lists.wikimedia.org/ [14:23:07] That's me [14:23:13] Testing the browser [14:24:28] https://lists.wikimedia.org/hyperkitty/list/test@lists.wikimedia.org/thread/4B5K36HMOQOCDY3H5RTX7C3JYUADRFCV/ [14:24:31] now it's there [14:24:35] super slow though [14:24:58] I just got a few emails from ops@ that are dated 14:23 [14:25:43] got it now too [14:27:01] it started accepting messages at Nov 10 14:23:17 2021 [14:27:13] maybe it was just busy doing other things? [14:27:14] but we restarted mailman a lot earlier than that [14:27:33] legoktm: let me send another email [14:28:20] nope, nothing yet [14:28:43] but how can it be related to the DB change? [14:28:51] it either works or doesn't [14:29:03] yeah, it can be another issue [14:29:13] https://lists.wikimedia.org/hyperkitty/list/test@lists.wikimedia.org/thread/EWBD4TNWRSH7XKWT2C5VX6T2J5TIE35R/ [14:29:23] It's a bit slow but faster than it used to be [14:29:51] marostegui: I think it was unrelated [14:30:40] Yeah, I am trying to compare grants and they do look the same [14:30:42] sorry for panicking :D [14:31:32] I think it's unrelated nor urgent [14:31:47] *and not urgent [14:32:01] Amir: Butchering English language since god knows when [14:32:37] I have left the instructions to revert at https://phabricator.wikimedia.org/T288093#7495852 in case it is needed [14:34:32] btullis: I have ack'ed the alert for db1108 [14:34:41] [15:34:29] <+icinga-wm> ACKNOWLEDGEMENT - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb@analytics_meta.service Marostegui T295312 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:41] T295312: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 [14:34:42] that ^ [14:35:14] marostegui: Thanks. I thought I had downtimed them all, but I missed one. [14:36:34] no problem! [14:45:22] marostegui: just confirmed, NDA group doesn't have access to orchestrator https://usercontent.irccloud-cdn.com/file/tVeVqyUw/image.png [14:45:37] sweet thanks for checking [14:46:00] so we'd need to work out with kormat what's the status and what's pending to give NDA people read access [14:46:53] it seems to be limited to ops and sre-admins only, so wmf does not have access too [14:51:45] https://gerrit.wikimedia.org/r/c/operations/puppet/+/737926 [14:51:55] I'll ask kormat once she's back [15:08:06] who says there's a status? [15:24:46] Amir1: how did you run into that error? you're in ops, no? [15:25:01] kormat: someone else [15:25:11] spy! [15:25:17] :D [15:27:19] I can also run into that error if it makes your life easier [15:27:51] haha [15:28:53] Amir1: you only have read access at the moment right? [15:29:03] yeah yeah [15:29:14] I think so [15:29:50] oki [15:30:24] just making sure people get the same as you got and not write one haha [15:31:22] marostegui: write access is controlled by the orchestrator.json itself [15:34:46] (which right now only mentions you and i) [15:39:02] marostegui: there's absolutely no way that mw could be depending on m5, right? i'm just looking at T295478, and feeling a bit nervous [15:39:03] T295478: MediaSearch returns "Invalid search" for any query - https://phabricator.wikimedia.org/T295478 [15:40:40] -search says it's an elasticsearch issue [15:41:09] 👍 [15:41:41] btullis: hey. when you restored the backup to db1108, did you restore the gtid pos? [15:42:04] kormat: https://phabricator.wikimedia.org/T288093#7461416 those are the DBs we have in m5, I don't think it is related to that in anyway [15:42:28] marostegui: ack. i was looking at them, couldn't see any way for it to be related either. just thought i'd ask :) [15:42:33] kormat: No, I don't think so. I have a record of it. Just looking up how to do it now. [15:43:15] btullis: ah ok. that would explain the replication error [15:46:18] kormat: I do a `SET GLOBAL gtid_slave_pos = "0-171971944-460977441"` (given that that's the value in `xtrabackup_info` and then start slave again, is that right? [15:46:39] you can try it, and see what happens :) [15:47:00] there's a chance you might need to do 'reset slave all', and reconfigure replication again [15:47:47] I think it's OK. 🤞 :-) [15:47:48] (we don't use gtid that much as it's very unclear how mariadb actually implemented it) [15:47:53] https://www.irccloud.com/pastebin/tHakT76N/ [15:48:19] btullis: if it stays in that state for more than a few seconds, you're almost certainly in the clear [15:50:12] kormat: Looking good. Thanks again. [15:50:20] btullis: \o/. you're welcome :)