[00:44:01] musikanimal: you may want to make the bot run the usernames through the username regex to enure it is legal [01:09:16] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:12:58] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:46:42] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 53 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [05:47:26] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 30.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [05:48:32] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [05:49:16] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [11:29:46] as a followup to the meeting yestarday, this is the error that missled me: "Wikimedia\Rdbms\LoadMonitor::computeServerStates: host {db_server} is unreachable" [11:30:06] and this is the ticket that Chris mentioned on the other incident doc: T265386 [11:30:06] T265386: Make LoadMonitor server states more up-to-date and respond to outages more quickly - https://phabricator.wikimedia.org/T265386 [11:31:35] ^this is the one I heard and maybe (I haven't checked) related to mw-level improvements Amir1 mentioned [11:32:13] jynus: every time I heard LoadMonitor, I want to cry [11:32:40] just passing things other SREs mentioned, don't shoot the messenger :-D [11:32:53] currently, it's just useless, was made to prevent incidents, makes a ton of connections and db queries, including cross-dc ones and have prevented zero (0) outages [11:33:16] the plan is to rewrite it fully [11:33:18] Tim is on it [11:33:19] I am in the opinion that load balancing should be its own dedicated service with more standardized tooling [11:33:42] Tim's work: T314020 [11:33:42] T314020: LoadMonitor connection weighting reimagined - https://phabricator.wikimedia.org/T314020 [11:34:22] My opinion is that at least for CLI, it should be like that (having a proxy in between for LB) [11:34:57] well, also use standarized wmf configuration [11:35:05] MW's rdbms library is trying to juggle two distinct worlds, long-running scripts and short web requests [11:35:35] sure, there may be some customization needs, but reinventing algorithm that things like haproxy has very robusltly implemented seems not the right direction to me [11:35:50] haproxy or many other 3rd party tools [11:36:14] yeah but let's not use sqlproxy, I do enough sql day and night :P [11:36:20] proxysql [11:36:53] yeah, my hope is to have maint scripts switch to haproxy maybe in a year or two, will see [11:37:11] and yes, something more custom could fit better, but in the end it is humans administrating it, and not only one developer that will have to manage it (standarization helps onboarding) [11:47:56] jynus: if you feel like it, can you write a doc outlining the benefits? [11:49:53] sure, it is more non technical and more practical reasons; basically the same idea of what k8s provide: standarization and microservices, which is more or less the general philosopy SREs (and WMF) are moving to [12:25:59] I am going to ack "Reduced availability for job mysql-labs in ops@codfw" because afaik it is just T326584 [12:26:00] T326584: Decomission db209[345] - https://phabricator.wikimedia.org/T326584 [12:26:17] jynus: mmm but db2093 is active? [12:26:32] I just stopped mysql there [12:26:35] I think it is just complaining about db2084 [12:26:39] *94 [12:26:48] but db2094 is sanitarium [12:27:09] I think it is in the "labs" prometheus group (as it is not core) [12:27:17] Ah then yeah [12:27:24] Might be related then yeah [12:27:29] but because it has very little instances [12:27:37] it complains even if there is only 1 host down [12:27:53] while for core would have to fail a lot of hosts [12:30:59] alertmanager UI is great, but at the same time suspiciously too well thought compared to how barebones icinga is that make me suspicious [12:31:22] I find alertmanager hard to read [12:31:23] I've silenced it for 2 days linking to the task [12:31:33] oh, I may have some tips [12:31:44] I added some filters and now I prefer it over icinga [12:31:53] which filters do you have? [12:32:16] @state=active team=sre alertname!=Check systemd state [12:33:04] and that gives me a much cleaner dashboard [12:33:11] For me it is still soooo full [12:33:19] I find it impossible to get the data [12:33:30] ha ha ha but that is a wmf issue, not a ui! [12:33:36] I guess it is a matter of getting used to it [12:33:48] No, I find it very hard to read [12:33:54] So many different boxes, colors, etc [12:33:55] I am not fond of the boxes [12:34:15] they are not super readable, for example, for finding the host [12:56:16] with you permission, I will put down db2184 in the afternoon to test again the 10.6 backup recovery [12:56:27] sounds good [12:56:42] I also gave you some more homework before you leave on holidays https://gerrit.wikimedia.org/r/c/operations/puppet/+/892948 [12:56:43] I will rename the datadir just in case [12:56:52] let me see [12:57:24] looks fine to me, can be merged now? [12:57:32] sure, anytime [12:57:35] Let me merge it [12:57:41] But maybe you need to try a backup for it? [12:57:55] sure, I think it takes 1 second to backup that :-D [12:57:59] ok, merging! [12:58:24] let me run it now [12:58:26] merged [13:01:52] I was wrong, it took 4 seconds: https://phabricator.wikimedia.org/T326596#8652464 [13:02:23] Can I also get a sanity check on https://gerrit.wikimedia.org/r/c/operations/puppet/+/892953 [13:02:32] jynus: that's 4 times what you predicted!!! [13:02:57] marostegui: just the binlog of pc1012 has gone from 310GB to 240GB [13:03:07] Amir1: that's awesome [13:04:31] marostegui: the ip is fine, but I am not sure what those are used for anymore? [13:04:42] jynus: Yeah, I was trying to figure that out now [13:05:00] it could be a leftover predating cumin hosts [13:05:11] jynus: They are used for orchestrator I believe [13:05:16] or from the times of tendril [13:05:27] but that would be dborch, right? [13:05:39] jynus: yeah, but the DB is in db1115/db2093 [13:05:42] are they needed from the dbs of orchestrator? [13:06:04] what I mean orch is at: 208.80.155.103 # dborch1001.wikimedia.org [13:06:10] not so sure about the dbs [13:06:11] yeah, that's the frontend [13:06:15] But I think we run stuff from the db too [13:06:32] that I don't know (I am not saying yes or now) [13:08:02] (I wasn't involved on anything orch related) [13:15:38] think if there is something you both may need me for before the end of the week- I have many things WIP re:backups, but shouldn't affect you (database backups, bacula should work as usual) [13:16:38] will do! [14:06:07] Any objections to this change? https://gerrit.wikimedia.org/r/c/operations/puppet/+/892964/ [14:11:38] interesting, that syntax is wrong [14:13:02] oh [14:13:11] Yeah I was running the compiler [14:13:58] Its only 3, apparently: *-*-* [14:14:14] yeah [14:14:19] I was just sending the patch [14:17:52] https://puppet-compiler.wmflabs.org/output/892964/39867/ [14:17:54] This looks a lot better [14:17:56] So merging! [15:36:02] marostegui: FWIW, everytime I make a change to systemd timers, I test it with sudo systemd-analyze calendar "Mon,Tue *-*-01..04 12:00:00" [15:36:22] h/t https://wiki.archlinux.org/title/Systemd/Timers [15:36:46] ah nice [15:37:45] Normalized form: *-*-* 05:00:00 [15:37:45] Next elapse: Wed 2023-03-01 05:00:00 UTC [15:37:46] Looks good! [16:48:17] I'ts gotten late and don't want to cause alerts while people are busy, so I will search other time for the recovery test [16:50:35] (but before the end of the week)