[08:25:53] I want to test 10.6 on db2098, to prepare for an eventual migration [08:33:48] oh [08:33:52] want me to install it there? [08:34:02] see review [08:34:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/836701 [08:40:33] jynus: Probably you want to do apt-get remove wmf-mariadb104 and then run puppet [08:41:12] any other warning or problem I could find (I was thinking of reimaging the full host)? [08:41:22] Ah, that'd be great [08:41:30] I do: mysql_upgrade --force [08:41:33] Just in case [08:41:44] I was planning to load it logically [08:41:52] ah cool [08:41:55] (this was about testing backups :-P) [08:42:12] is the latest package on the debian repo? [08:42:46] (the one that supposedly fixes the issue) [08:43:20] yes [08:43:23] 10.6.10 [08:43:27] I pushed it to the repo earlier today [08:43:29] thanks [08:44:02] this is mostly to make sure I can recover and backup from it, as I may hit some workflow you hadn't (maybe) [08:44:12] e.g. tooling compatibility [08:44:35] yeah, no, it makes sense to start testing there [08:44:40] I want to have some 10.6 okrs for next Q [08:45:33] and as I was going to rebuild db2098 anyway, it will take me no time but could solve pain later [09:59:04] I will be adding to this paste things that I find out along the way to later either become actionable or documentation: https://phabricator.wikimedia.org/P35151 [09:59:30] oh excellent [09:59:39] Let me create a task with those [09:59:54] that way I don't keep pinging you all the time and you decide if they are important or not [10:00:50] https://phabricator.wikimedia.org/T318914 [10:24:13] weird, but this wasn't me! Check systemd state on db1183 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service [10:32:20] let me see [10:33:25] uh, what is that? [10:33:47] https://bugzilla.redhat.com/show_bug.cgi?id=1023820 [10:46:36] I have started it [11:05:09] I'm re-importing db2098:s7 from dumps, it may take a while due to reduced available memory + compression: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2098&var-port=13317 [11:05:28] after that, I will load db2098, but I may have to wait until next week [11:05:37] *db2098:s8 [11:51:16] (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [12:35:14] not sure why that is failing if metrics are actually showing up [12:39:37] interesting, I think I found something different, but not sure if it is 10.6 of my installation process [12:41:16] (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [12:56:16] (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [13:25:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2098:13318) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [13:30:16] (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [13:40:16] (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [14:10:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2098:13318) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [14:15:16] (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [15:18:13] jynus: failing means systemd thinks the unit is failed. How that matches up to reality is left as an exercise for the reader :) [15:21:46] I think there is some incompatibility with mariadb > 10.5 [15:25:16] (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [15:26:15] and that may be new compared to a regular upgrade [15:26:26] we've not tested 10.6 in multi instance hosts so maybe there's something [15:27:08] prometheus needs BINLOG MONITOR, PROCESS, SLAVE MONITOR [15:27:51] and I think if I just apply the current grants the last is omitted [15:28:21] I added this to P35151 [15:28:27] but more work may be needed [15:29:47] what we have not tested either is a full logical recovery+10.6 [15:29:49] most metrics will be not affected, but I think the non-fatal errors are detected by prometheus [15:29:59] yes, that is why I am testing it :-D [15:31:14] the graceful failure is nice for not losing metrics, but harder to debug [15:32:02] (I didn't thought of checking the prometheus scrapper metrics if metrics were going through) [15:33:36] the other thing that was confusing me is "access denied" metrics [15:34:08] access denied tracks both connection denied and operation denied- connections was going through but only some statements were failing [15:34:51] now there are 0 access denied errors: https://grafana.wikimedia.org/goto/jgEpEI4Vz?orgId=1 [15:36:04] (mariadb dlogs the first but not the second on the error log :-/) [15:36:29] (at least with current verbosity) [15:37:19] Emperor: sorry for the noise but better having an issue now and finally understanding what was going on, than on a real emergency recovery [15:46:28] the other part of my confussion is thinking that Debian #953040 was fixed [15:47:25] (which technically is, but not in the way I complained about), they suggest using the password "nopassword" instead of "This is a fake passsword, but cannot be empty due to Debian #953040"