[03:24:48] PROBLEM - MariaDB sustained replica lag on s8 on db2167 is CRITICAL: 10.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104 [03:25:48] RECOVERY - MariaDB sustained replica lag on s8 on db2167 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104 [05:42:40] PROBLEM - MariaDB sustained replica lag on s8 on db2195 is CRITICAL: 104.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [05:47:40] RECOVERY - MariaDB sustained replica lag on s8 on db2195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [07:34:46] jynus: I've acked db2139 broken replication alarm as it is going to be decommissioned [08:36:07] Error 'Index for table 'page_props' is corrupt, I see [09:01:32] Is there any test production server available? [09:01:40] *test db [09:03:08] Can I use db2202 ? [09:04:27] I will upgrade it after using it [09:05:45] go! [09:12:06] taking db2202, will battle test s4 recoveries and then decomm db2139 [09:37:51] I am switching s5 eqiad master now [09:40:28] All good [11:14:00] @volans I have few questions regarding the automation we want ot implement for DBA stuff: 1) I see logs from cookbooks in a local file but I cannot find their data on logstash or traces on jaeger. Where do the logs go? [11:15:11] federico3: sorry meeting, that's pending T213902 [11:15:11] T213902: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 [11:16:32] I have back to back meetings until lunch, we can chat about that after lunch if that's ok for you [11:16:43] volans: thanks. I can leave few other question here when you have time later on - they are not urgent. [11:16:55] great [11:17:58] volans: 2) Are cookbooks only meant to be ran interactively or can they run as daemons or cronjobs? [11:19:00] AFAIAA they're only ever run interactively. [11:19:36] Amir1: switchmaster down? [11:19:50] Ah no, never mind [11:19:54] it was a temporary 500 [11:20:00] ah okay [11:20:27] it sometimes 500s for when there are two candidate masters in puppet or none or other issues like that [11:21:05] No, in this case it wasn't like that, it didn't even let me generate a task anyway [11:21:08] It works now [13:15:26] federico3: here we are. So for (1) I think I've answered, but lmk what you were looking for. For (2) it depends on the cookbooks, some are interactive some not, but in general most assume interactivity. There are very few interactive bits in Spicerack's libraries. So a cookbook that is not interactive could be run via a systemd timer if needed, but it would be the first, so lmk your use [13:15:32] case to be able to give better feedback. [13:17:07] in general there is a project to be able to schedule/run opt-in cookbooks via some internal API (maybe also a UI?). I can't promise timewise but have been discussed in the past and resurrected recently for T384837. So it could be possible that it will be tackled in the next quarter. [13:17:07] T384837: Integration between alertmanager and cumin cookbooks - https://phabricator.wikimedia.org/T384837 [13:17:10] volans: the use case is still https://phabricator.wikimedia.org/T384810 :) I'm putting together a basic design doc and it's linked in the summary [13:17:35] sure :) I meant the specific bit of your use case :D [13:17:39] I'll have a look [16:25:45] federico3: it took 15 minutes to catch up: https://grafana.wikimedia.org/goto/CDJUU3ONg?orgId=1 [16:26:24] although the host could have been up in read only mode once it started [16:26:59] I've noticed but I'm a bit puzzled by the i/o patterns [16:27:55] it's a log graph [16:28:10] I'm aware, I'm referring to https://grafana-rw.wikimedia.org/d/000000273/mysql?forceLogin=true&from=now-1h&orgId=1&refresh=1m&to=now&var-job=All&var-port=9104&var-server=db2202&viewPanel=20 [16:28:29] aha, there's an error in the grafana query [16:29:02] what's the error? [16:30:12] they are averaging at 5m but showing granularity of 1m (and using the old visualization template) [16:32:03] speaking of grafana, I've notice the charts are often set to interpolate values instead of showing the raw value which can be visually misleading, shall we set them to "step" mode? [16:33:35] So for the mysql dashboards: ask the DBAs (or whoever teams the particular graph), for global config observability team would be the people to ask [16:34:21] e.g. if you saw that there is a global mistake, we could file a task for obs team and so it gets changed globally, etc. [16:35:48] I belive there are worse issues TBH [16:36:09] there is no pt-table-heartbeat lag on prometheus [16:36:14] for mysql hosts [16:36:20] so the lag is not reliable [16:37:50] https://phabricator.wikimedia.org/T141968 [16:38:34] D-: [16:38:43] (on grafana) [16:38:54] it is on icinga, but that is soon to be deprecated [16:39:51] here is other monitoring gaps: https://phabricator.wikimedia.org/T143896 [16:40:12] one of the big blockers is that there is no private monitoring storage system [16:41:04] sorry can you please elaborate on no private monitoring storage system? [16:41:20] everything on prometheus is publicly available [16:41:39] so things like table sizes, we could very easilly send it there, but cannot at the time [16:42:07] as it could potentially leak user activity [16:43:39] I've asked in the past to have a private instance for sensitive metrics or any alternative workflow [16:48:36] jynus: thanks for the info! [17:01:09] federico3: I will bring down db2202 to go back to s1 there, upgrade it and rebuild tables there [18:00:56] I have left db2202 rebuilding tables [18:01:04] it is a test host so not issue [18:01:07] have a nice day