[08:03:46] <jynus>	 apparently the s5 replication delays also hit codfw sanitarium
[08:04:03] <jynus>	 I wonder if it was just a new wiki being setup?
[08:14:42] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 6.681e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[08:16:46] <jynus>	 ^that's me
[08:16:52] <jynus>	 it should catch up soon
[08:17:19] <jynus>	 it is the "we have 2 places to downtime" issue
[08:30:30] <Emperor>	 ms-be2028 has full `/`, looks like a swift drive didn't get mounted so we filled the rootfs instead. There's notes in swift/Howto on fixing this...
[08:38:22] <Amir1>	 dropping the old templatelinks columns of cebwiki
[08:39:00] <Amir1>	 Every night before sleep I was dreaming about dropping old columns of templatelinks in cebwiki, that's what gets me through tough days :P
[08:50:54] <Amir1>	 jynus: ^ this will drop a couple of hundreds of GB from backups
[08:51:05] <Amir1>	 you might get an error or something
[08:53:03] <jynus>	 yesterday I got:
[08:53:29] <jynus>	 Last snapshot for s4 at eqiad (db1150) taken on 2022-08-09 21:29:38 is 1702 GiB, but the previous one was 1830 GiB, a change of -7.0 %
[08:54:44] <jynus>	 but it could be just related to db source changes
[08:54:55] <jynus>	 or optimization due to table rebuilding
[08:54:58] <Amir1>	 jynus: so that's only because of templatelink alter table causing an optimize on it. The actual drop for s4 will come in two week
[08:55:43] <Amir1>	 I have to do two alter tables, first one to make the old fields nullable (so I can stop writing to them) and the second one to drop them
[08:55:45] <jynus>	 I am shutting down early the backup hosts that finished their daily tasks
[08:55:58] <Amir1>	 I just started the first alter on s4 last week
[08:56:59] <Amir1>	 hence the alert you got due to optimize
[08:57:37] <Amir1>	 some also bigger context: lots of wikis are switching to lua which uses less number of templates, reducing the total number of rows but we never optimized them
[08:58:24] <jynus>	 how does lua causes to reduce the number of templates, because conditional arguments?
[08:59:20] <jynus>	 I've seen it happening on eswiki, my home wiki, but I didn't know it was a frequent thing among wikis
[08:59:52] <jynus>	 as the early days were the gold rush of creating templates
[09:05:13] <Amir1>	 yeah because lots of wikitext templates depend on lots of functionalities string trimming, etc and they create lower level templates for it but in lua you get them for free
[09:05:56] <Amir1>	 thanks for shutting dbprov
[09:05:57] <jynus>	 btullis: I belive this should be normal, but just in case sending it to DE: Last dump for matomo at eqiad (db1108) taken on 2022-08-09 03:12:03 is 226 MiB, but the previous one was 239 MiB, a change of -5.4 %
[09:06:17] <Amir1>	 I start dealing with the D5 dbs 
[09:06:17] <stashbot>	 D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5
[09:06:42] <jynus>	 Amir1: I remember there used to be town templates for every region so they could put a different background- instead of adding an if!
[09:07:10] <Amir1>	 exactly, sigh
[09:07:35] <Amir1>	 it has got much better but still templates by nature quite heavily used everywhere
[09:08:09] <jynus>	 they are not evil, it depends on how they are being used
[09:10:53] <btullis>	 jynus: Thanks, I will check out matomo to see if there's any obvious cause. I'm not aware of any recent changes to the service, but it's possible.
[09:11:44] <jynus>	 btullis: let me help- the monitoring dashboard keeps a list of table sizes, so I can send a list of tables that got shrinked if necessary
[09:14:48] <jynus>	 forgot about data check- nowiki now ongoing (alphabetical order)
[09:15:03] <jynus>	 no data issue so far compared to eqiad
[09:18:38] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[09:22:15] <jynus>	 db1117 finally caught up
[09:32:58] <jynus>	 I will wait a bit before stopping db2101, while it is not in use, it is still replicating, so I will do it closer to the maintenance window
[09:33:39] <jynus>	 let me know if anyone needs help double checking downtimes, patches and hosts, etc.
[09:36:52] <Amir1>	 yeah let's stop and shut them down closer to the window
[10:00:37] <Amir1>	 jynus: I think I'm good for all dbs in D row except dbproxy2004, it's downtimed but I'm not sure what would happen if we shut it down
[10:01:09] <Amir1>	 manifest/site.pp says it's passive
[10:01:39] <Amir1>	 > I'm not sure what would happen if we shut it down
[10:01:45] <Amir1>	 Only one way to find out mwhaha
[10:13:27] <jynus>	 can you double check https://phabricator.wikimedia.org/T310146#8142200 ? check the host names, specifically
[10:14:04] <jynus>	 Amir1: I checked dbproxy2004 and it is passive- proxies are pooled by dns, and m5-master is not pointing to it 
[10:16:29] <Amir1>	 jynus: thanks. I meant db2181 and db2182
[10:16:53] <jynus>	 actually, it is not passive
[10:17:21] <jynus>	 m5-master.codfw.wmnet points to it
[10:17:31] <jynus>	 but I doubt it will have any actual traffic
[10:18:15] <Amir1>	 we have services that are active/active but not sure what are on m5, let me double check
[10:18:50] <jynus>	 Amir1: so you remove db2181 and db2182 because db2081 and db2082 were decommision recently? sounds strange?
[10:19:40] <Amir1>	 jynus: I have trouble following, yes. They have been removed from the rack to my knowledge 
[10:20:35] <Amir1>	 Am I missing something super obvious 😅
[10:20:57] <jynus>	 I see those setup on our dc: T306849
[10:20:57] <stashbot>	 T306849: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849
[10:21:17] <jynus>	 I just don't know what is the impact of db2081 on db2181
[10:21:24] <jynus>	 and same for the other pair
[10:21:42] <Amir1>	 ah, that's my mistake, I missed the 1 there
[10:26:26] <jynus>	 so, are you reverting your edit, then?
[10:27:05] <Amir1>	 yup
[10:27:32] <jynus>	 ok- I gave a clean list of dbs on the schedule description- if that helps with the work
[10:27:49] <jynus>	 it is much easier to get those from the dashboard
[10:30:48] <RhinosF1>	 Amir1: you conflicted elukey
[10:31:27] <btullis>	 jynus: Just to follow up, I haven't found anything amiss with matomo. I think we can probably ignore the change in size of the backup. But thanks for bringing it up. 👍
[10:32:32] <jynus>	 yeah, it is only a warning if it is between 5-15 % but wanted to give you a heads up
[10:33:40] <jynus>	 soon we should be able to setup individual percentages by backup- we can make that higher for matomo if needed to avoid alert spam
[10:35:07] <jynus>	 RhinosF1: sadly phab lacks conficting edits detection :-(
[10:35:08] <Amir1>	 RhinosF1: fixed
[10:35:28] <btullis>	 jynus: Ack, thanks. Not sure what matomo's long-term prospects are, but good to know.
[10:35:40] <RhinosF1>	 Amir1: thanks
[10:35:58] <RhinosF1>	 jynus: the lack of edit conflict checks is one of phab's worst features
[10:36:06] <RhinosF1>	 It's probably the easiest one to solve too
[10:36:22] <Amir1>	 I remember bugzilla would call it "mid-air collision"
[10:36:45] <RhinosF1>	 Add a timestamp to the edit url as a parameter, check that parameter when saving against the last edit in the DB
[10:37:30] <jynus>	 Amir1: apologies if adding stress- as I said I offer any help you need, but I follow Manuels recommendation that- if I see something that looks not intended, I speak up early
[10:37:55] <Amir1>	 jynus: stress now is much better than stress during the maint :D
[10:38:58] <Amir1>	 I downtimed them, I don't think they are in any section yet, so we can probably just reboot them
[10:39:04] <jynus>	 +1
[10:39:27] <Amir1>	 it doesn't even have mariadb installed so reboot it goes
[10:39:41] <Amir1>	 ugh not reboot, power down
[10:39:43] <jynus>	 it is super easy to get confused with so many numbers, that is why 2 pair of eyes see more than 1 :-D
[10:41:48] <jynus>	 remember I originally schedule the pdu maintenance backup work on the wrong month!
[10:42:09] <Amir1>	 haha nice
[10:42:19] <jynus>	 until manuel told me very confused
[10:42:31] <jynus>	 *scheduled
[10:43:02] <Amir1>	 my favorite mistake I have seen was that in one of mw schema change patches, the mssql alter added the column to the wrong table and it went unnoticed for two years (=four released)
[10:43:06] <Amir1>	 *releases
[10:43:53] <jynus>	 I can see it- not many msssql users, it is an addition, ...
[10:44:46] <Amir1>	 yeah, at least now it gets generated through abstract schema changes so you don't need to repeat alters three times (and adjust it for RDBMSes you don't know) and avoid mistakes like this
[10:46:32] <RhinosF1>	 The abstract schema work makes things much nicer to do
[10:47:26] <Amir1>	 yeah
[10:47:54] <Amir1>	 jynus: on m5 we have cx, labsdbaccounts, mailman3, striker, toolhub
[10:48:02] <Amir1>	 I know mailman doesn't work in codfw
[10:48:12] <jynus>	 yeah, cloud I think is all eqiad
[10:48:22] <jynus>	 what about mailman, too?
[10:48:33] <Amir1>	 mailman is not active/active 
[10:48:46] <jynus>	 I don't know about cx, but I think it is user normally in bursts
[10:48:47] <Amir1>	 I don't think it even has a codfw VM
[10:48:57] <jynus>	 *used
[10:49:12] <Amir1>	 if mw doesn't get read there, cx won't get it either I think
[10:49:16] <jynus>	 we can check active connections to the proxy
[10:49:47] <Amir1>	 ah yeah
[10:51:15] <jynus>	 interesting, there is actually some from kubernetes
[10:56:40] <jynus>	 I confirm it is cxserver through sys
[11:01:48] <Amir1>	 I ping Kartik 
[14:29:12] <jynus>	 let me know if you need any help re: pdu
[14:39:13] <jynus>	 acked codfw sanitarium s1 replication
[14:43:48] <Amir1>	 thanks
[14:44:07] <jynus>	 that way when it goes green it will go back to normal automatically
[14:44:18] <jynus>	 downtime isn't that great once the alert is off
[14:44:48] <jynus>	 I also mostly ack with time duration (24 hours in this case) to make sure it is not forgotten
[14:46:42] <Amir1>	 I'm not following, I will make sure every host is back online and getting replication once the pdu maint is over
[14:46:53] <Amir1>	 which would automatically resolve the alert
[14:47:38] <jynus>	 so if you downtime a host that it is alerting, it will alert again on recovery
[14:47:49] <jynus>	 so it is not super-useful
[14:48:07] <jynus>	 if you ack it, it will remove the ack automatically after going green
[14:48:35] <jynus>	 the downtime is kind of designed as a pre-alert workflow, the ack as a post-alert workflow)
[14:48:48] <Amir1>	 hmm, okay
[14:49:01] <jynus>	 now, if that makes sense or not, blame icinga :-D
[14:49:47] <jynus>	 the danger of acks is that you could forget it forever- so I normally use ack with a defined expiration time
[14:50:40] <jynus>	 in reallity there is not a big difference between both, downtime and ack means the alert is in handled state
[14:50:47] <Amir1>	 I hope we migrate them off icinga soon
[14:50:53] <jynus>	 yeah
[14:51:08] <jynus>	 although alertmanager will have its challenges too
[14:51:18] <jynus>	 but probably will be a net win
[14:58:16] <jynus>	 do you mind if I take care of starting back up db2101? I want to do it early to make sure backups are in a working state soon. 
[14:59:10] <jynus>	 (not yet, obviously)
[15:01:40] <Amir1>	 sure
[15:08:58] <jynus>	 cxserver codfw :-D
[15:09:51] <Amir1>	 meh, it's not like anyone is using it, the whole codfw is depooled :D
[15:10:14] <Amir1>	 I have been pinging people since this morning 
[15:10:39] <jynus>	 I know, that is why I was smiling
[15:11:19] <jynus>	 but to be honest, that dependency surprised me
[16:16:34] <jynus>	 taking db2101 myself, Amir1
[16:16:47] <Amir1>	 ack
[16:16:51] <Amir1>	 I'm doing the rest
[16:26:01] <jynus>	 db2101 looking good: https://grafana.wikimedia.org/goto/cySokSi4z
[16:36:37] <jynus>	 waiting now for D7 and backup2007
[16:36:37] <stashbot>	 D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7
[17:14:07] <Emperor>	 swift nodes in D.7 now back and un-downtimed, so I think that's me done for today
[17:47:10] <jynus>	 everything looking good on my side- if there is any maintenance left can be done tomorrow- bye!
[17:47:49] <Amir1>	 I brought back all the dbs and replication coming back
[17:48:08] <jynus>	 great work, Amir1
[17:48:20] <jynus>	 now have a deserved rest :-D
[17:48:28] <jynus>	 I know I am
[17:48:31] <Amir1>	 Thanks.Now I need to debug some deadlocks :P
[17:48:34] <Amir1>	 go rest!
[21:59:18] <RhinosF1>	 Amir1: that template links stat is insane
[21:59:32] <RhinosF1>	 I very much hope we get the same at Miraheze