[08:25:53] <jynus>	 I want to test 10.6 on db2098, to prepare for an eventual migration
[08:33:48] <marostegui>	 oh 
[08:33:52] <marostegui>	 want me to install it there?
[08:34:02] <jynus>	 see review
[08:34:20] <jynus>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/836701
[08:40:33] <marostegui>	 jynus: Probably you want to do apt-get remove wmf-mariadb104 and then run puppet
[08:41:12] <jynus>	 any other warning or problem I could find (I was thinking of reimaging the full host)?
[08:41:22] <marostegui>	 Ah, that'd be great
[08:41:30] <marostegui>	 I do: mysql_upgrade --force
[08:41:33] <marostegui>	 Just in case
[08:41:44] <jynus>	 I was planning to load it logically
[08:41:52] <marostegui>	 ah cool
[08:41:55] <jynus>	 (this was about testing backups :-P)
[08:42:12] <jynus>	 is the latest package on the debian repo?
[08:42:46] <jynus>	 (the one that supposedly fixes the issue)
[08:43:20] <marostegui>	 yes
[08:43:23] <marostegui>	 10.6.10 
[08:43:27] <marostegui>	 I pushed it to the repo earlier today
[08:43:29] <jynus>	 thanks
[08:44:02] <jynus>	 this is mostly to make sure I can recover and backup from it, as I may hit some workflow you hadn't (maybe)
[08:44:12] <jynus>	 e.g. tooling compatibility
[08:44:35] <marostegui>	 yeah, no, it makes sense to start testing there
[08:44:40] <marostegui>	 I want to have some 10.6 okrs for next Q
[08:45:33] <jynus>	 and as I was going to rebuild db2098 anyway, it will take me no time but could solve pain later
[09:59:04] <jynus>	 I will be adding to this paste things that I find out along the way to later either become actionable or documentation: https://phabricator.wikimedia.org/P35151
[09:59:30] <marostegui>	 oh excellent
[09:59:39] <marostegui>	 Let me create a task with those
[09:59:54] <jynus>	 that way I don't keep pinging you all the time and you decide if they are important or not
[10:00:50] <marostegui>	 https://phabricator.wikimedia.org/T318914
[10:24:13] <jynus>	 weird, but this wasn't me! Check systemd state on db1183 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service
[10:32:20] <marostegui>	 let me see
[10:33:25] <marostegui>	 uh, what is that?
[10:33:47] <marostegui>	 https://bugzilla.redhat.com/show_bug.cgi?id=1023820
[10:46:36] <marostegui>	 I have started it
[11:05:09] <jynus>	 I'm re-importing db2098:s7 from dumps, it may take a while due to reduced available memory + compression: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2098&var-port=13317
[11:05:28] <jynus>	 after that, I will load db2098, but I may have to wait until next week
[11:05:37] <jynus>	 *db2098:s8
[11:51:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[12:35:14] <jynus>	 not sure why that is failing if metrics are actually showing up
[12:39:37] <jynus>	 interesting, I think I found something different, but not sure if it is 10.6 of my installation process
[12:41:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[12:56:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[13:25:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2098:13318) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[13:30:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[13:40:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[14:10:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db2098:13318) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[14:15:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[15:18:13] <Emperor>	 jynus: failing means systemd thinks the unit is failed. How that matches up to reality is left as an exercise for the reader :)
[15:21:46] <jynus>	 I think there is some incompatibility with mariadb > 10.5
[15:25:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) resolved: (2) Prometheus-mysqld-exporter failed (db2098:13317) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[15:26:15] <jynus>	 and that may be new compared to a regular upgrade
[15:26:26] <marostegui>	 we've not tested 10.6 in multi instance hosts so maybe there's something 
[15:27:08] <jynus>	 prometheus needs BINLOG MONITOR, PROCESS, SLAVE MONITOR
[15:27:51] <jynus>	 and I think if I just apply the current grants the last is omitted
[15:28:21] <jynus>	 I added this to P35151
[15:28:27] <jynus>	 but more work may be needed
[15:29:47] <marostegui>	 what we have not tested either is a full logical recovery+10.6
[15:29:49] <jynus>	 most metrics will be not affected, but I think the non-fatal errors are detected by prometheus
[15:29:59] <jynus>	 yes, that is why I am testing it :-D
[15:31:14] <jynus>	 the graceful failure is nice for not losing metrics, but harder to debug
[15:32:02] <jynus>	 (I didn't thought of checking the prometheus scrapper metrics if metrics were going through)
[15:33:36] <jynus>	 the other thing that was confusing me is "access denied" metrics
[15:34:08] <jynus>	 access denied tracks both connection denied and operation denied- connections was going through but only some statements were failing
[15:34:51] <jynus>	 now there are 0 access denied errors: https://grafana.wikimedia.org/goto/jgEpEI4Vz?orgId=1
[15:36:04] <jynus>	 (mariadb dlogs the first but not the second on the error log :-/)
[15:36:29] <jynus>	 (at least with current verbosity)
[15:37:19] <jynus>	 Emperor: sorry for the noise but better having an issue now and finally understanding what was going on, than on a real emergency recovery
[15:46:28] <jynus>	 the other part of my confussion is thinking that Debian #953040 was fixed
[15:47:25] <jynus>	 (which technically is, but not in the way I complained about), they suggest using the password "nopassword" instead of "This is a fake passsword, but cannot be empty due to Debian #953040"