[00:58:34] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:31] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 19.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:01] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 24 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:57] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:13:17] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [04:38:23] Emperor: yeah, I don't know why they are alerting even. I have disabled those systemd units [04:58:34] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter@s7.service Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:26] I have silenced them for 10 days [05:04:31] While we check what is going on [05:04:36] godog: ^ any idea? [06:18:29] jynus: what are your plans with db1150? will it go back to s5 and s4? or should I just delete it from orchestrator (it will show up automatically anyways whenever you've set it back up) [06:40:04] So icinga reports ok for db1101 but we still had those alerts [07:33:46] marostegui: I'll finish prepping for the switch maint and then look into those [07:33:59] thanks godog [07:35:11] sure np [08:40:22] Re:db1150 the plan is to provision it with some sections, but not yet decided which [08:40:45] jynus: ok, I will remove the existing ones in orchestrator and the new ones will show up automatically [08:44:35] Can I have a quick sanity check on https://gerrit.wikimedia.org/r/c/operations/puppet/+/903602 ? [08:48:08] thank you Emperor! [08:53:53] (SystemdUnitFailed) firing: (3) prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:12] jynus: ^ can I start mariadb on that host (that's the backup source) so at least we don't get the noise? [08:57:39] sure, sorry, I got net-split so I didn't saw the noise [08:57:54] no worries, starting it now [08:58:55] I have also restarted the ipmi exporter [09:00:52] as I moved the service, setting up db1150 is not a priority right now for mw [09:00:57] *me [09:03:39] (SystemdUnitFailed) resolved: (3) prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:34] (SystemdUnitFailed) firing: (3) prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:54] I give up [09:17:31] :spam: [09:20:38] oh [09:21:46] "Error parsing config file: unknown fields in modules: collector_cmd, custom_args" [09:21:52] that looks like a bug to me [09:22:17] yeah, but the thing is that I disabled that unit [09:24:03] I would downtime the host and report it, I think that unit is new [09:25:09] yeah definitely a problem with ipmi-exporter there [09:25:33] marostegui: to recap, you have silenced db1101 but still shows up? [09:25:43] shows up as in, alerts show up [09:26:18] godog: not for now, but I assume it will once the silence is finished? [09:27:15] I did a reset-failed [09:27:20] I did many times XD [09:27:26] but I guess it will be reenabled on next puppet run [09:28:41] marostegui: you are correct yes, if the alert keeps firing and you want the silence to renew (as long as the alert fires) then you can turn the silence into an ack, by starting the silence comment with "ACK! " [09:29:13] marostegui: there's a longer explanation here https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements [09:30:32] Thanks godog [09:30:55] the root problem is...why does it keep firing? [09:33:34] (SystemdUnitFailed) resolved: prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:49] indeed, to be clear which unit are we talking about in this case? ipmi-exporter or auto-restart mysqld exporte r? [09:34:57] there are two hosts showing this, db1150 (prometheus-ipmi-exporter.service) and db1101 (wmf_auto_restart_prometheus-mysqld-exporter@s7.service and wmf_auto_restart_prometheus-mysqld-exporter@s8.service) [09:36:30] was the other one that had hw issues? maybe it didn't got well deployed because that [09:36:50] es2029? [09:37:06] that one isn't showing up, only db1150 and db1101 (db1101 didn't have HW issues, it was moved from one section to another) [09:38:46] hah the section move explains for db1101, where was it moved from/to marostegui ? [09:43:34] (SystemdUnitFailed) firing: (2) prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:47] re: ipmi-exporter, there was an outdated version on db1150, I guess the fleetwide rollout/upgrade didn't catch it (I've upgraded it now) [09:45:12] so that should recover soon [09:46:41] godog: it was moved in puppet and zarcillo from s7/s8 to m1 [09:48:02] ack, yeah it looks like puppet doesn't clean up the old wmf_auto_restart units for s7/s8 and then fail when the service isn't running [09:48:34] (SystemdUnitFailed) firing: (2) prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:38] the immediate fix is of course to remove the units, though puppet should ideally clean those up [09:48:40] that's a nice race conditio, cause even if you disable it.... [09:48:44] yeah [09:57:01] (SystemdUnitFailed) firing: (2) prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:27] wait, I thought db1150 was fixed after upgrade! [09:59:23] I can also take the time to upgrade it to 10.6 now that it is not in active use [10:02:51] db1150 which upgrade? db1150 is the one that crashed due to the DIMM [10:03:03] mariadb upgrade [10:03:11] test 10.6 on a passive source [10:03:21] if you see it useful [10:03:40] no I mean you meant it was fixed afte upgrade....which upgrade? [10:03:44] *passive backup source [10:03:55] I thought the issue was ipmi-exporter [10:03:59] package [10:04:08] ah no idea [10:04:08] fwiw wmf-auto-restart-ipmi-exporter was also failing, I started it back up [10:04:08] but it is still alerting [10:04:11] and +1 go 10.6 [10:04:26] which should fix the alert I think [10:04:35] godog fixed it then [10:04:42] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on db1150:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:24] ^LUL. this amount overhead- mysql, the monitoring of mysql, and the monitoring of the monitoring of mysql seems too complex :-D [10:05:47] anyways db1101 will be decommissioned next week (or reimaged) [10:06:09] ack, I'll remove the auto-restart units for s7 and s8 for now [10:07:28] thanks [10:07:44] {{done}} [10:08:02] we need some kind of cleanup process that wipes all of that [10:08:43] +1, do sections moves happen on a regular basis? [10:08:55] it is not super- common, but it happens [10:09:26] I see, thank you [10:09:32] yeah not very usual. and I normally reimage but with db1101 I was in a hurry [10:09:35] reimaging the whole thing just to change data (e.g. to add a section) is not worth it [10:09:56] e.g. for db1150 I had s4 but I changed s3 into s5 [10:43:11] random petty grievance: the automatic refresh of orch is quite annoying, I pinned the tab and while I make it stop every day it manages to unstop it [10:45:56] I think it is configurable, but I really like it [10:51:57] 😭 [11:13:02] Amir1: for the db stuff+switch you will not need me, right? (unless something weird happens) [11:13:26] si, I want to know when I can do the m1 switchover back later [11:13:35] like, I will be around also to have a look at backups hosts, but you will do the downtimes and stuff [11:13:47] whne yup [11:14:05] that one but also when does work for you for the second m1 switchover? [11:14:19] these week backups get a bit overloaded, but surely I can fine a gap [11:14:42] *find [11:15:01] let me know when [11:15:31] maybe friday or monday, but let me see after the switchover for a more concrete date- in case some jobs get delayer [11:15:34] *delayed [11:17:24] sure [11:17:59] if you open to improvise, I can ping you as soon as things look safe [11:18:01] *are [11:19:44] sure, this week I'm on clinic duty so things are a bit hectic [11:21:31] (SystemdUnitFailed) firing: swift_rclone_sync.service Failed on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:26] undestandable [11:32:31] the alerts UI is a bit confusing - if I tick the obvious ticky that just silences the alert for a short period [11:35:08] * Emperor RingTFM [15:02:26] I was busy with other stuff, do you need any help with dbs or swift or something re:network maintenance? [15:02:31] I will recheck backups myself now [15:02:51] swift was fine [15:26:31] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s7.timer Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:49] godog: ^ our good old friend [16:47:32] marostegui: ssiiiiigh ok I'll make a note to check back tomorrow on what's going on [19:26:31] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s7.timer Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:31] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.timer Failed on db1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:46] are you a "s7 mysqld exporter failed"-person or "excessive lag on m1"-person?