[04:32:54] andrewbogott: What's up! [05:00:40] Amir1: how long was s5's downtime for? [05:00:54] codfw might alert? it has been delayed for 1d and 18h now XD [07:23:40] marostegui: I downtimed it again today and yesterday (for a couple of days) [07:23:46] ok! [07:24:41] We have to combine the alters, otherwise it'll shut down replication to the cloud for a week [07:24:55] It should finish soon though [07:25:53] Amir1: In case you have some maintenance going on on s7, please stop it or make sure it finishes before tomorrow: https://phabricator.wikimedia.org/T313383 [07:26:34] Yeah. I'll be there tomorrow. I think I have on s5 thingy [08:14:15] I'm still landing, so feel free to ping me for important stuff [08:15:13] landing seems like a bad time to be pinged :) [08:16:26] nah, you can do vfr and attend an outage at the same time, I was told :-) [08:16:59] hahahaha [08:17:08] welcome back jynus [08:17:14] something's wrong here... did you two steal each other's nicks? [08:18:15] line content and nick don't match :D [08:19:44] good news is that I belive some backups failed during tonight, but this was detected and corrected automatically [10:14:58] s4 data size is 1.8TB now, including innodb compression [10:25:52] marostegui: the s5 codfw replication will start by the next two hours [10:26:14] jynus: is it bad? it should be fixed in a couple of weeks [10:28:16] it is not bad per se, just the typical recommended shard size is ~256GB for a time to recovery under 5 minutes [10:30:38] 453G templatelinks.ibd [10:30:54] It'll be back to 100GB at most [10:43:48] I think I caught up on email, I may have some questions for you later, but those can wait [10:44:00] you == a few people on this channel [10:47:17] * Emperor hides [10:47:20] :) [10:56:20] I was going to ask you if you could migrate ms hosts to ceph + bookwork for tomorrow, but I guess it can wait :-) [11:16:35] lol [11:23:15] I have failed over m3-master [11:23:29] As part of the C5 switch crash preventive actions [11:39:33] that line above would have been nicer with s/over m3-master// :-P [13:21:21] jynus: I have stopped db2078 mysql today (the old host used for misc backups in codfw), we replaced it with db2160 past week, I plan to decommission it tomorrow, all looking good from your side since the change, right? [13:21:36] yeah [13:21:41] good! [13:21:44] will we keep it around for some time [13:21:47] just in case? [13:21:55] as in, a week or so? [13:22:18] So db2160 has been in place for a week already, looking at dbbackups.backups I can see it is being used fine, but if you prefer me to wait to decom db2078, I can do that too [13:22:22] ah, you just said tomorrow [13:22:43] if it is still up, can I do a one time copy just in case [13:22:45] ? [13:22:51] It is yeah [13:23:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/811884 this is when we replaced it, 7th July [13:23:38] I know I am being super careful, just wouldn't take me much time to run a snapshot [13:23:44] Sure, go for it [13:23:48] and you can decom tomorrow as planned [13:23:59] Sure, if you prefer me to wait till Friday, I can also do that, no problem [13:24:01] what about eqiad [13:24:11] do you have an eta for that? [13:24:16] for decom? [13:24:17] No, that's not being touched [13:24:40] not yet, or not ever? [13:24:50] I don't even know when db1117 expires :) [13:24:54] ah [13:24:55] ok [13:24:55] I guess sometime this FY maybe [13:24:57] But no idea :) [13:24:58] so different date [13:25:00] gotcha [13:25:16] ok, so let me leave a one time snapshot running [13:25:31] Sure, no problem [13:25:37] I left the backup snapshot running, so you will be able to see the progress [13:25:47] *backup dashboard [13:26:48] marostegui: 4 years since 1117 was racked [13:26:54] https://phabricator.wikimedia.org/T191792 [13:26:59] So yeah, sometime this FY I guess [13:34:26] marostegui: the s5 replication is flowing but 18h behind, can I run it on dbstore now? [13:34:39] sure! [13:34:47] awesome [13:34:55] I am just waiting for the new wiki to replicate to sanitarium codfw hosts [13:35:00] So I can sanitize it there [13:35:09] ERROR 1049 (42000): Unknown database 'blwiki' [13:35:12] Still not there heeh [13:35:32] are you checking the sanitarium master? [13:35:44] that'll take a couple of days though [13:35:51] Riiiiight [13:35:54] No rush anyways [13:36:15] Worst case scenario we will have an email on monday saying that there's private data on codfw sanitarium hosts [13:36:20] Which is "fine" as they are not in use [13:36:24] I will check on FRiday though [13:36:28] Before leaving [13:36:43] I can run it on Saturday/Sunday [13:36:52] Nah, no problem [13:36:55] wait, this weekend is Berlin Pride, nvm [13:36:56] It is not a big deal at all [13:37:01] Haha [13:37:10] I will be drunk most of the weekend I assume [13:42:04] hey all, i added the objectives we had discussed in the OKR meetings into betterworks, so you can now add your KRs aligning to them as discussed [13:42:07] if in doubt, please talk to me :) [13:43:33] I have one question for marostegui regarding OKRs- how are you seeing *that* objective you thought I could help with- do you think it could become a bigger thing with the data you have so far? [13:43:51] jynus: Yeah, I don't know yet :( [13:44:02] basically I want to organize myself and think of a worse case situation [13:44:05] Yeah [13:44:19] I think worst case, we can try to see how it works in different versions (10.7 or 10.8 even) [13:44:32] But so far I don't think we need you for now other than brainstorms [13:44:40] But hard to tell [13:44:41] or... 8.X :-D ? [13:45:10] haha [13:45:14] I have actually thought about it [13:47:39] question_mark: I created the mw objective earlier today so I could create the KRs but now we have two "Improve stability and scalability of MediaWiki database usage" and I can't move the KR to under yours so I could delete it [13:47:50] I can delete yours and make modifications [13:47:59] that works too, if you can assign it to me [13:48:16] Uff I forgot about betterworks! [13:48:18] So many stuff going on [13:48:32] no worries, i asked to do it by end of week ;) [13:48:38] and i only just put in my objectives [13:49:43] marostegui: it's blkwiki not blwiki [13:49:50] question_mark: yup done now but I can't delete your empty objective now [13:50:00] RhinosF1: yeah thanks, I noticed, it is not there anyways :) [13:50:16] np [13:52:56] Amir1: done [13:53:13] thanks [14:04:50] marostegui: sorry I missed your ping. I'm adding new grants to m5 for some new servers, the new grants are in this diff: https://gerrit.wikimedia.org/r/c/operations/puppet/+/815378/1/modules/profile/templates/mariadb/grants/production-m5.sql.erb I'm happy to just run those commands myself but I don't know if the process has changed... it's still all done by hand right? [14:05:02] (also if you're rushing about handling the bad router issue then this can definitely wait!) [14:05:21] andrewbogott: Hello! I can do it no, worries [14:05:22] ongoing transference: https://grafana.wikimedia.org/goto/2ZBPP0RVk?orgId=1 https://grafana.wikimedia.org/goto/NhqmsAgVk?orgId=1 [14:05:28] Let me get to it now andrewbogott [14:05:29] thanks! [14:09:43] andrewbogott: Grants added, I have not merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/815378 as there's more stuff there apart from the sql file, but it can be merged anytime [14:11:06] sounds good, thanks again [14:38:39] marostegui: is it okay if I turn off innodb_flush_neighbors on a couple of pc dbs (and a couple of replicas in s6?) I'm sure all of pc is ssd and I double check before enabling anything on s6 [14:38:52] Amir1: yep, that's fine! [14:39:02] does it need a restart? [14:39:06] Amir1: All s6 are SSDs [14:39:12] Amir1: no, just set global innodb.... = 0; [14:39:18] awesome [14:39:53] from my queries I think all of es is on hdd, a lot are not accessible [14:40:13] stuff like WARNING:root:Host db1123.eqiad.wmnet has unknown media type for drive ATA.... [14:40:33] es on hdd? [14:40:37] I think we replaced all of them already [14:41:03] let me send you the report [14:41:31] for all the disks or just 2? if it's 2 per hosts might be just the OS and/or some additional embedded one (not sure if the SDcards are reported by those commands I gave you) [14:42:22] volans: I see it for all of raid I think https://phabricator.wikimedia.org/P31525 [14:42:36] Amir1: you are right, they are spinning disks, just check their purchase https://phabricator.wikimedia.org/T235820 [14:42:46] I think we did that to save money to have more space [14:43:17] yep, confirmed, SATA by looking up their serial number from megacli [14:43:33] 4tb is not much space :D [14:43:48] I don't see db1123 in that paste :D [14:44:11] volans: that's only for es [14:44:17] all of it is long [14:44:23] Amir1: it is 12x2TB disks :) [14:44:58] ah sorry [14:45:01] volans: https://phabricator.wikimedia.org/P31526 [14:45:09] all are protected by nda [14:45:55] thx [14:45:56] marostegui: hmm, should I divide it by three (RAID?) [14:46:19] just curios [14:46:50] Amir1: No, it is RAID10, so by 2 [14:46:52] 11TB usable [14:46:56] (more or less) [14:47:01] cool [15:12:49] I don't know how much it's related to our work or how much it's related to active/active work but codfw requests are erroring because heartbeat can't find an s8 db on s1: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2022.07.20?id=1pMgHIIBWJDkYQXofH04 [15:14:57] That's weird [15:15:00] that's a new host though [15:16:00] But it is replicating in s8 and s1 (multi-instance) [15:16:06] and it all seems fine there [15:16:16] https://orchestrator.wikimedia.org/web/search?s=db2167 [15:16:18] No lag or anything [15:16:22] So heartbeat is working as expected [15:16:31] unless i pooled it on the wrong places [15:16:32] let me check [15:16:43] marostegui: it's pooled https://noc.wikimedia.org/dbconfig/codfw.json [15:16:52] yeah, i know it is pooled [15:16:56] I found the issue [15:16:58] I mean both instances [15:17:00] :D [15:17:18] sigh [15:17:18] Fixed! [15:17:23] Danke [15:17:25] https://phabricator.wikimedia.org/P31531 [15:17:28] Thanks for the heads up [15:17:33] So many things at the same time :( [15:17:41] thank you for fixing it [15:18:04] we should have something to avoid pooling stuff like this [15:18:46] marostegui: errors are normal now \o/ https://logstash.wikimedia.org/goto/ede19894b5537e826318ec2c90646145 [15:23:52] <3