[00:23:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:42] I've switched x1 codfw and now switching es7 codfw [07:07:40] dhinus: m5 is now running 10.11 on the master too [07:30:27] marostegui: ack, thanks! [07:42:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:19] es jobs run correctly, hosts can go to maintenance any time now [07:51:24] *ran [08:20:24] I'm going to remove the temporary grants I added now [08:23:16] marostegui: db1154 has been firing for replica lag https://alerts.wikimedia.org/?q=alertname%3DMariaDB%20sustained%20replica%20lag%20on%20x3&q=team%3Dsre&q=%40receiver%3Dirc-spam [08:23:43] federico3: can you check what is going on? [08:23:50] ok, looking [08:24:12] thanks, let me know what you find [08:26:06] strange, the dashboard is missing a lot of data https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13363&from=now-30d&to=now&timezone=utc&var-job=%24__all&refresh=1m [08:26:58] looks like a glitch in the alert, if I go from the mysql dashboard and search for the host the data shows up [08:28:13] I think it is missing from zarcillo [08:28:51] federico3: For replication lag troubleshooting I suggest your first action is to go to orchestrator to find out what the topology is doing, if it is all the hosts, just one, the role etc [08:30:01] https://phabricator.wikimedia.org/P78593 [08:30:38] federico3: btw the way the host lagging is db1155 not db1154 [08:31:18] in any case, db1154 is not fully setup [08:31:44] marostegui: I'm looking at the alert from db1154 on alertmanager [08:31:50] jynus: what? [08:31:52] "MariaDB sustained replica lag on x3" [08:32:11] marostegui: it is not on zarcillo, so it won't have any prometheus metrics [08:32:38] I see it on prometheus [08:32:42] Ah the section x3 for db1154 [08:32:58] federico3: was it lagging in orchestrator? [08:33:07] no, I'm a bit puzzled [08:33:14] is it on s4? [08:33:37] well it is not anymore in s4, anyway, the reason why db1155 in s4 was lagging was a schema change [08:33:38] after setup, it should be checked that prometheus can read it well so we get metrics [08:34:12] I am talking about db1154:x3 [08:34:54] I have inserted it [08:35:30] db-mysql db1154 is hanging - does it need a flag to specify which instance to connect to? [08:36:05] ah yes, the port [08:36:22] personally, I use "mysql.py -h db1154:x3" as it is easier to remember [08:36:42] thanks [08:54:52] oh TIL /etc/wmfmariadbpy/section_ports.csv [09:09:10] we can talk about that deeper at another time or you can ask am*r, but there was a migration that didn't complete and we ended up with 2 separate wmf db libraries [09:09:47] this affects me becaue dbbackups depend on some of that functionality [09:35:39] Amir1: can db2155 be repooled? [09:36:01] wait a second, I don't remember why it got depooled [09:36:14] the commons check? one sec [09:36:35] ah yeah, sorry I forgot [09:37:39] Amir1: you take care of repooling it then? [09:37:45] si [09:38:01] thank you [09:42:11] what's the next step for db1154? [09:42:57] It recovered right? [09:43:40] yes [09:43:58] then there are no next steps, I added it to zarcillo as I mentioned above [09:48:12] repooling [09:48:16] sorry I forgot [09:49:07] marostegui: I'm a bit puzzled at how we monitor multiinstances tho: for example I see mysql_version_info{instance="db2151:9104", job="mysql-core", role="master", shard="s6", version="10.11.11-MariaDB-log", version_comment="MariaDB Server"} on grafana - note the shard [09:50:41] what is wrong with the shard? [09:52:45] I mean I'm only seeing s6 and not x3 - is it because the exporter is unaware of x3? [09:53:15] I am not following, db2151 is only on s6, it is not on x3 [09:53:36] (It is not a master btw, so that is wrong) [09:54:30] oops, sorry pasted the wrong line :D :D [09:55:36] Amir1: I think you confused db1255 with db2155 [09:55:55] Amir1: In fact db1255 is the master for x3 [09:56:29] ah sorry again [09:56:32] I need coffee [09:56:58] I remember I explicitly did codfw cause depool would be less impactful but stupid brain [09:57:00] sorry [10:00:21] clearly we need a check digit in server numbers to prevent that confusion [10:02:33] the captcha in the cookbook should ask for sha256 of the host name. cc _joe_ [12:04:25] marostegui: I added the api block in dbctl for https://phabricator.wikimedia.org/T385141#10937373 - yet do we have a way to check if the weights are the right ones and so on? [12:04:54] Amir1: does this looks good to you? https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/jobs/541314 [12:05:22] federico3: there is an ongoing incident [12:05:43] you may want to wait a bit until it finishes and Amir later takes a break [12:06:19] ah, ok [12:07:44] federico3: As I mentioned before, please let's track things on tasks, it is easier than IRC, which is pretty volatile and not very async [12:19:11] ok [12:22:39] x1 master is having some issues, I am checking [12:23:05] I will upgrade the package with the patched one just in case [13:32:00] sobanski: you joining the meeting? [13:54:16] federico3: let me check tasks re: prometheus [13:55:09] this is the primary task, driven by obs team: https://phabricator.wikimedia.org/T321808 [13:55:45] and then there is individual ones for each category [13:55:57] T367282 T367283 T367284 [13:56:19] some could be doen through: T350360 [13:56:31] T369045 [13:56:34] is see one for mariadb https://phabricator.wikimedia.org/T367284 [13:56:47] nope, therre are more [13:56:57] ah yes [13:56:58] I think they created one per alert [13:57:10] and they may not even been exahustive [13:57:45] I would start with something easy [13:58:03] e.g. where the data is already there and just the logic has to be created [13:58:20] and should be set in puppet [13:59:07] This one by amir is rather complete: https://phabricator.wikimedia.org/T315866 [13:59:41] maybe could be used as the epic for DBAs and migrate others under that [14:00:25] For example, "memory pressure" could be easy, but it is used by ganeti too, needs some cooordination [14:00:50] are the obs team ok with migrating the tasks under our epic? [14:00:54] ^ would this suggestion work? also ask the dbas if they have other priorities [14:01:09] federico3: no idea, I'd say, ask :-D [14:01:28] godog is usually the person that manages that [14:01:50] but that is coordination, the task can be done no matter how it is organized on phabricator [14:02:11] just be aware of some alerts both mariadb and beyond being used beyond our team, will require coordination [14:02:27] e.g. lag on core is very improtant, not so important on backup sources [14:02:50] others are more straightforward: "alert if memory is close to full" [14:02:59] yes we spoke about this recently, I'll look around a bit more. I think for a lot of stuff it makes sense to set thresholds that are specific to our use case. The logic can be shared but the thesh might be pretty specific [14:03:09] yes [14:03:16] so that's the difficult part, even for me [14:03:22] managing our needs vs other needs [14:03:37] do you have this as a goal in your tasks? [14:03:43] because this is a very large project [14:04:36] I will talk to federico3 in private about the tasks and priorities [14:04:39] ok [14:05:35] ah, at the moment I was looking just at what needs to be done while putting together some dashboards for practical needs [14:06:32] yeah, manuel approach looks to me fine, discuss with him priorities,as it is a very large project [14:06:57] that will have to be done, but given it is a migration, not a new thing, it is currently ok-ish [14:07:31] my point is to understand the basic needs so than when putting together some dashboard/alarms that are needed for something else I can do it without reinventing the wheel :) [14:30:06] is there anyone around that could sanity-check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1162917 & https://gerrit.wikimedia.org/r/c/operations/puppet/+/1162918 [14:32:28] sure, two ticks [14:35:26] thanks! [14:36:09] +1*2 [14:36:46] ^surprisingly correct python code [14:37:29] I would had gone for: '+1'*2 instead [14:57:48] [g+1 for g in gerrits] [14:58:25] jynus: doh, you nerd sniped me! [15:26:24] can I start a kernel update on s5 in eqiad? it should affect db1154 [15:26:35] marostegui, Amir1 ^^^^ [15:27:00] You aware what db1154 is? [15:27:07] If you are, then fine [15:27:21] the multi-instance host we were looking at this morning? [15:29:32] ah you probably mean it's the source for redacted/clouddb so the run will skip it [15:30:03] federico3: I don't remember how the script works, but I think it skips hosts if they have slaves under them [15:30:14] it does [15:30:32] Then go ahead for s5 [15:30:51] same for db1213 [15:31:08] that is m3 master [15:32:28] so I don't think there's anything left for the kernel upgrades that we can do without flipping masters [15:32:52] all the external store hosts that are read only can be done [15:33:04] (the masters can be switched with just a dbctl command, and I can help with that) [15:33:10] but please ensure all the replicas are done [15:33:26] you've been doing es* upgrades, do you want me to upgrade them? [15:35:38] yes please, go for them [15:35:44] You can do the pending ones that are not master [15:35:53] And I can walk you thru masters once those are done [15:36:19] Double check because I am not sure the script works for them [15:38:27] I don't think it does, IIRC, however what would the process be? [15:39:43] ensure the host is not a source for others; depool it; update mariadb and OS? ; reboot ; check and pool in? [15:40:06] the process is the same as a normal replica yes [15:40:15] There are two problems [15:40:48] You have no way to differentiate what is a master, RO sections have standalone hosts, so none of them will have any output on a show slave status; or show slave hosts; [15:42:29] So the only way to know if it is a master is by checking if they are "masters" on MW, so you'd need to either parse the json files or parse the output of: dbctl -s eqiad section es4 get | grep master [15:42:41] To avoid that host [15:45:49] Hey hey folks! I've been pointed over here from #wikimedia-serviceops. I'm working on deprecating LQT on ptwikibooks which, among other things, is necessary for getting temp accounts working. [15:45:56] uhm looking on https://zarcillo.wikimedia.org/ i see some es* masters have weight set to 0, some 100 [15:46:29] I've got a script that's from 2015, with a few improvements, that will move approximately 13,000 LQT threads to Flow. It has no dry-run mode. [15:46:53] aha, it maps into which sections are RO vs RW [15:47:52] marostegui: stupid question: if the "masters" in a RO section are not receiving writes nor doing replication... what makes them masters and why can't we just depool them without flip? [15:48:10] federico3: Because MW needs masters there [15:48:20] The switch is simply a dbctl command, but it is still needed [15:48:22] We've managed to reassure ourselves that it behaves well in a development environment, but the real-world environment has been the victim of several historic attempts to do this, some complete, some partially-complete, some reverted. Unfortunately, LQT isn't included in the database dumps and it hardly makes sense to build a custom export for it now, on the eve of archiving and deprecating the whole lot [15:48:59] Amir1: ^ do you have any input on that? [15:49:53] So I'm wondering if there's a way to mirror the existing wiki (or just its database, if I can run my own isntance somewhere that can see it) for poking-around purposes? [15:50:07] It's hardly a crisis if not, but, an ounce of prevention is worth a pound of cure and all that [16:14:16] zip: have you tried it in beta cluster? [16:14:23] or testwikis [16:17:55] Do they mirror live data? [16:18:10] I had been under the impression that they did not [16:19:25] they are their own wikis, you can install LQT, make edits and run the scripts to make sure they work [16:23:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:23] uf, that's returning [16:26:35] checking and hopefully nuking it forever [16:28:47] running puppet to make sure it doesn't recreate the systemd file [16:28:58] *unit [16:33:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-bacula-exporter.service on backup1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:31] hopefully that's gone for good [16:39:34] I'm not sure that gives me any benefit over running against my dev wiki, unfortunately