[08:34:10] !incidents [08:34:10] You're not allowed to perform this action. [08:55:26] !incidents [08:55:26] 6255 (RESOLVED) es2038 (paged)/MariaDB Replica Lag: es7 (paged) [08:55:26] 6256 (RESOLVED) es2040 (paged)/MariaDB Replica Lag: es7 (paged) [08:57:21] federico3: https://wikitech.wikimedia.org/wiki/Vopsbot#Installation_and_configuration describes where the ACL for sirenbot comes from [10:10:38] Can someone from o11y help me debugging FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable ? [10:10:52] I think I know what is going on but I need some help with prometheus to confirm [10:11:08] marostegui: I'll take a look [10:11:31] tappof: Thanks, I think it comes from the fact that db2186 and db2187 are no longe part of labs, but I am unable to confirm [10:11:33] What is your doubt? [10:13:30] marostegui: https://prometheus-codfw.wikimedia.org/ops/targets?search=&scrapePool=mysql-labs [10:14:35] Exactly! [10:15:11] I think that confirms your hypothesis [10:15:34] tappof: Thanks, it does. We are going to reimage this host today , so it should clear up [10:15:40] I will ping you if it doesn't, if that's ok [10:15:47] sure marostegui [10:16:33] tappof: grazie mille [10:16:38] elukey: ^ [10:16:40] marostegui: prego :) [10:18:16] marostegui: without gestures it doesn't count! [10:18:24] hahah [10:18:29] ahah [12:08:31] tappof: So federico3 finished reinstalling the host, I thought https://prometheus-codfw.wikimedia.org/ops/targets?search=&scrapePool=mysql-labs would clear (as those ports are no longer on that host) but now I am wondering how this is built. I know it is using zarcillo DB, but that looks good [12:10:57] marostegui: the reimage script just triggered a puppet run on the Icinga server, deleted the silence and is starting puppet on in for the first run [12:11:12] Maybe we need a puppet run on the prometheus hosts then [12:11:14] let me see [12:11:45] and it says "Starting first Puppet run (sit back, relax, and enjoy the wait)" :D [12:11:56] ah sorry I thought it finished the reimage! [12:12:00] Then it is not then [12:12:12] maybe it does it after this step? [12:12:40] We'll see when it finishes then [12:13:45] https://phabricator.wikimedia.org/P76722 fyi [12:22:03] This is showing db2187 but I don't get how if that host is no longer in zarcillo /srv/prometheus/ops/targets/mysql-labsdb_codfw.yaml [12:22:34] marostegui: federico3 I think the next Puppet run on the Prometheus instances should fix the job's targets, but I'll double-check [12:22:59] I'm running it on prom2005 now. [12:23:15] I have fixed a small thing on zarcillo, but it shouldn't be the cause [12:23:25] it's wating_for_optimal since a bit [12:23:50] federico3: Did you start mariadb? [12:24:12] no it's still running the cookbook [12:24:28] yes, but it will timeout [12:24:35] because it will never be optimal if you don't start mariadb [12:24:36] should I start in while the cookbook is still running? [12:24:45] ok [12:25:36] tappof: It didn't change anything on the puppet run [12:25:43] Per the logs [12:26:47] mariadb is running [12:26:55] federico3: start slave too [12:27:13] show tableS; [12:27:16] gah, not here [12:28:36] is the master configured by some automation? [12:28:40] no [12:28:42] what? [12:29:19] /opt/wmf-mariadb106/scripts/mysql_install_db --basedir=/opt/wmf-mariadb106/ should I run this before? [12:29:35] No, the data has not been touched [12:29:41] You've reimaged the host without formatting /srv [12:29:46] just start mariadb and then start slave; [12:30:00] ok, started [12:31:58] yep, replag dropping [12:33:51] I think I found the issue [12:36:28] > tappof: It didn't change anything on the puppet run [12:36:32] Yeah, I noticed [12:40:13] tappof: I am running out of ideas, but I am seeing that /usr/local/sbin/mysqld_exporter_config.py isn't generating mysql-labs.yaml anymore (because it now has nothing) so I am wondering if there's some clean up needed on prometheus? [12:40:50] marostegui: Let me try something [12:40:50] mysql-labs.yaml for eqiad does have hosts, but the codfw doesn't and it won't ever have again [12:52:09] marostegui: I just ran /usr/local/sbin/mysqld_exporter_config.py -D, and mysql-labs is not listed in the output, so I think it has no way to compute the diff with the old config file and remove it ... [12:52:48] tappof: Yeah, that's what sort of mentioned, like if there's some cleanup we have to do there [12:53:49] marostegui: Yeah, I think it's enough to delete the mysql-labsdb_codfw.yaml file from the ops/targets directory [12:54:56] Let's see [12:55:46] Ok, that looks clean now, running puppet to make sure it is not going to be generated again [13:00:24] tappof: puppet didn't create it, so all good! [13:00:26] thanks for the help [13:02:20] marostegui: You're welcome.. I'm going to remove the file from prom2006 as well [13:02:29] thanks [21:03:42] nothing to report during on-call