[10:22:45] Amir1: Any reason why https://phabricator.wikimedia.org/T291584 cannot be done earlier than oct? [10:22:45] Amir1: Any reason why https://phabricator.wikimedia.org/T291584 cannot be done earlier than oct? [10:22:48] ie: next week? [10:22:49] ie: next week? [11:06:30] "All s1 core codfw hosts upgraded to 10.4.21" you mean the replicas to the minor version? [11:06:30] "All s1 core codfw hosts upgraded to 10.4.21" you mean the replicas to the minor version? [11:08:20] yes, sorry [11:08:20] yes, sorry [11:08:23] let me edit that comment [11:08:23] let me edit that comment [11:08:30] no issue, I understood now [11:08:30] no issue, I understood now [11:08:40] I thought you were asking to merge the backup migration patch [11:08:40] I thought you were asking to merge the backup migration patch [11:08:47] No no :) [11:08:47] No no :) [11:08:54] Haven't done the master switch yet or the candidate master [11:08:54] Haven't done the master switch yet or the candidate master [11:09:00] I am doing small things here and there during the clinic duty week [11:09:01] I am doing small things here and there during the clinic duty week [11:09:01] but you mean to upgrade the 10.4 source [11:09:01] but you mean to upgrade the 10.4 source [11:09:04] yep [11:09:04] yep [11:09:08] I can take care of taht [11:09:08] I can take care of taht [11:09:12] thanks :) [11:09:13] thanks :) [11:09:20] as I will have to upgrade s6 too [11:09:20] as I will have to upgrade s6 too [11:09:53] yeah, I didn't want to go ahead and do it without you involved just in case [11:09:54] yeah, I didn't want to go ahead and do it without you involved just in case [11:20:41] jynus: if you want to do the s8 source replica too, that's also fine [11:20:41] jynus: if you want to do the s8 source replica too, that's also fine [11:21:04] yeah, I will do all soon, but not today [11:21:04] yeah, I will do all soon, but not today [11:21:13] sure, no prob! [11:21:13] sure, no prob! [11:21:19] unless you have plans to upgrade s8 soon? [11:21:19] unless you have plans to upgrade s8 soon? [11:21:22] no no [11:21:23] no no [11:21:26] I am upgrading the replicas [11:21:26] I am upgrading the replicas [11:21:49] I will do s8 next, I had to do maintenance on them anyway [11:21:49] I will do s8 next, I had to do maintenance on them anyway [11:22:06] but it is a lot of servers and I really need to finish something else first [11:22:06] but it is a lot of servers and I really need to finish something else first [11:22:36] no worries at all [11:22:36] no worries at all [11:27:02] so if the question is really, can you take care of upgrading the source backups- the answer is: of course! [11:27:02] so if the question is really, can you take care of upgrading the source backups- the answer is: of course! [12:53:39] marostegui: the sooner the better [12:53:40] marostegui: the sooner the better [12:53:44] I don't mind when [12:53:44] I don't mind when [12:54:18] ah ok, as you mentioned earlier Oct I thought it was something specific [12:54:19] ah ok, as you mentioned earlier Oct I thought it was something specific [13:20:04] marostegui: Thank you so much <3 [13:20:04] marostegui: Thank you so much <3 [13:20:28] <3 [13:20:28] <3 [14:01:09] I'm looking at alerting for the "prometheus-mysqld-exporter has failed" issue T257056 ; We don't have a data-persistence/dba-specific alert route configured (cf https://github.com/wikimedia/puppet/blob/production/modules/alertmanager/templates/alertmanager.yml.erb ), so shall I put it under sre as a whole, or do we want a more specific route? [14:01:09] I'm looking at alerting for the "prometheus-mysqld-exporter has failed" issue T257056 ; We don't have a data-persistence/dba-specific alert route configured (cf https://github.com/wikimedia/puppet/blob/production/modules/alertmanager/templates/alertmanager.yml.erb ), so shall I put it under sre as a whole, or do we want a more specific route? [14:01:09] T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 [14:01:09] T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 [14:02:52] Thanks for looking into that [14:02:52] Thanks for looking into that [14:03:04] I think it is very DBA specific to put it as a whole SRE [14:03:04] I think it is very DBA specific to put it as a whole SRE [14:03:32] So maybe we should create a data-persistence one, so we can also add any other future one there [14:03:32] So maybe we should create a data-persistence one, so we can also add any other future one there [14:03:35] That was my thinking too. [14:03:36] That was my thinking too. [14:04:43] (also, bit sad that we seem to have (at least?) 3 different alert-configuration approaches in use) [14:04:43] (also, bit sad that we seem to have (at least?) 3 different alert-configuration approaches in use) [14:05:02] 3? [14:05:03] 3? [14:05:42] Grafana alerts, alertmanager alerts configured via the puppet repo, alertmanager alerts configures via the operations/alerts repo [14:05:42] Grafana alerts, alertmanager alerts configured via the puppet repo, alertmanager alerts configures via the operations/alerts repo [14:06:21] I am not familiar with operations/alerts, is a new repo open to non-SREs? [14:06:21] I am not familiar with operations/alerts, is a new repo open to non-SREs? [14:06:30] Stevie Beth referred me to modules/profile/manifests/mariadb/replication_lag.pp in puppet, but https://wikitech.wikimedia.org/wiki/Alertmanager leads me to think operations/alerts is where (at least new) alerts should be configured [14:06:30] Stevie Beth referred me to modules/profile/manifests/mariadb/replication_lag.pp in puppet, but https://wikitech.wikimedia.org/wiki/Alertmanager leads me to think operations/alerts is where (at least new) alerts should be configured [14:07:05] so filipo is the canonical person to ask really, but I think the idea is we are in a bit of a transition [14:07:05] so filipo is the canonical person to ask really, but I think the idea is we are in a bit of a transition [14:07:19] from icing-based to alert manager [14:07:19] from icing-based to alert manager [14:07:44] in theory SREs are not yet migrating to alertmanager [14:07:44] in theory SREs are not yet migrating to alertmanager [14:07:59] but some time-based alerts or metrics are only available on grafana [14:07:59] but some time-based alerts or metrics are only available on grafana [14:12:09] prometheus has mysql_exporter_last_scrape_error so I think it is at least plausible that an alert could be set up on it. [14:12:09] prometheus has mysql_exporter_last_scrape_error so I think it is at least plausible that an alert could be set up on it. [14:12:40] in a way a very similar alert to that already exists [14:12:40] in a way a very similar alert to that already exists [14:12:49] in prometheus :-) [14:12:50] in prometheus :-) [14:13:15] Oh? T257056 suggests we lacking one... [14:13:15] Oh? T257056 suggests we lacking one... [14:13:15] T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 [14:13:15] T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 [14:13:46] no, we do lack one, I was contradicting only your last statement :-D [14:13:46] no, we do lack one, I was contradicting only your last statement :-D [14:13:54] let me find it [14:13:55] let me find it [14:15:12] (which is relevant about possible implementation, not just to make you angry :-) [14:15:12] (which is relevant about possible implementation, not just to make you angry :-) [14:15:53] This monitors and alerts when the number of scrapping failures is over the threashold: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=Prometheus+jobs+reduced+availability [14:15:53] This monitors and alerts when the number of scrapping failures is over the threashold: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=Prometheus+jobs+reduced+availability [14:16:31] so one could be setup, theoretically, based on prometheus + a lower threshold [14:16:31] so one could be setup, theoretically, based on prometheus + a lower threshold [14:17:01] but most likely a less distributed, more classical option is preferred, mentioning it because it is probably quite relevant [14:17:01] but most likely a less distributed, more classical option is preferred, mentioning it because it is probably quite relevant [14:19:21] I don't know really [14:19:21] I don't know really [14:22:16] :) [14:22:16] :) [14:30:47] Damn tendril, you run a big query and then it dies entirely... [14:30:47] Damn tendril, you run a big query and then it dies entirely... [14:30:58] Going to try to fix it [14:30:58] Going to try to fix it [14:37:32] It is amazing how as soon as you mess up with its buffer pool it goes crazy [14:37:32] It is amazing how as soon as you mess up with its buffer pool it goes crazy [14:53:41] ok, looks fixed now [14:53:41] ok, looks fixed now [16:01:45] CI that just says "failed" without any hint as to what failed make me sad [16:01:45] CI that just says "failed" without any hint as to what failed make me sad [16:02:03] Emperor: which one? [16:02:03] Emperor: which one? [16:02:08] https://gerrit.wikimedia.org/r/c/operations/alerts/+/723223 [16:02:08] https://gerrit.wikimedia.org/r/c/operations/alerts/+/723223 [16:02:30] volans: which tells me https://integration.wikimedia.org/ci/job/trigger-alerts-pipeline-test/109/console [16:02:30] volans: which tells me https://integration.wikimedia.org/ci/job/trigger-alerts-pipeline-test/109/console [16:02:32] oh wow, that's new [16:02:33] oh wow, that's new [16:02:36] which just says "failed" [16:02:36] which just says "failed" [16:02:48] Emperor, looks to me like a glitch [16:02:48] Emperor, looks to me like a glitch [16:02:57] say recheck as a comment to retry [16:02:57] say recheck as a comment to retry [16:02:58] I was hoping was the puppet repo that is not very good at showing you the failures, but they are there [16:02:58] I was hoping was the puppet repo that is not very good at showing you the failures, but they are there [16:02:59] I mean, I've never tried making prometheus alerts & tests before, so I probably did do _something_ wrong, but :) [16:03:00] I mean, I've never tried making prometheus alerts & tests before, so I probably did do _something_ wrong, but :) [16:03:16] ah, it is a different repo [16:03:16] ah, it is a different repo [16:03:29] still try recheck first, see if it repeates [16:03:29] still try recheck first, see if it repeates [16:04:37] doing so [16:04:37] doing so [16:07:42] well, if I install enough stuff myself, I can run the tests myself (and get a barfogram), but IWBNI the CI exposed that for me [16:07:42] well, if I install enough stuff myself, I can run the tests myself (and get a barfogram), but IWBNI the CI exposed that for me [16:15:48] recheck has same problem [16:15:48] recheck has same problem [16:16:15] so the repo must be setup incorrectly, or there is some issue with CI checks [16:16:15] so the repo must be setup incorrectly, or there is some issue with CI checks [16:16:44] sorry I cannot be of more help, I haven't used that repo before (my guess is it must be relatively recent) [16:16:44] sorry I cannot be of more help, I haven't used that repo before (my guess is it must be relatively recent) [16:18:00] https://integration.wikimedia.org/ci/job/alerts-pipeline-test/109/console [16:18:00] https://integration.wikimedia.org/ci/job/alerts-pipeline-test/109/console [16:19:05] majavah: OK, so where should I have found that? [16:19:05] majavah: OK, so where should I have found that? [16:19:35] Ah. [16:19:35] Ah. [16:20:05] from the last log you pasted, see the "18:56:14 alerts-pipeline-test #109 started." line? [16:20:05] from the last log you pasted, see the "18:56:14 alerts-pipeline-test #109 started." line? [16:20:17] click on that build number and then console output [16:20:17] click on that build number and then console output [16:20:45] Yes, found that now. Thank you! [16:20:45] I think it's some CI technical restriction that it needs a wrapper build like that [16:20:45] Yes, found that now. Thank you! [16:20:45] I think it's some CI technical restriction that it needs a wrapper build like that [16:22:51] it is, of course, whitespace in YAML that is causing me woe [16:22:51] it is, of course, whitespace in YAML that is causing me woe [16:26:35] tox run locally is now happy; let's see if the CI is... [16:26:35] tox run locally is now happy; let's see if the CI is... [16:30:19] success [16:30:19] success [23:50:14] I have a gerrit patch that I think would fix the grants for T271480 up at https://gerrit.wikimedia.org/r/c/operations/puppet/+/723329. [23:50:14] I have a gerrit patch that I think would fix the grants for T271480 up at https://gerrit.wikimedia.org/r/c/operations/puppet/+/723329. [23:50:14] T271480: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 [23:50:15] T271480: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480