[05:01:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:58] Amir1: Let me know if I can create pc8 in dbctl [05:19:11] I believe it is being ignored at the moment by MW, but just to double check [07:32:01] Are we expecting to not have the data-persistence team on alerts.wikimedia.org any more? [07:39:35] ? [07:41:16] ! [07:41:47] I've seen it is gone for dbs, do you have an example for non-db alerts? [07:52:12] Well, if I visit alerts.wikimedia.org and start typing team=dat into the search bar, I get offered data-engineering and data-platform. And even if I finish typing out team=data-persistence, the UI no longer lets me look at all our team's alerts [07:52:45] [Maybe this is a consequence of us having no alerts even silenced ones right now, but that would be unexpected] [08:14:59] I will keep an eye for it in case it is only that [08:24:38] dbprov1005 backups failed due to not having the latest mariadb package- fixing [08:25:57] (not really failed, but aborted by logic) [08:26:44] less failing, more deferring success? :) [08:27:17] can it be considered a failure if it did what it exactly told it to do? [08:27:23] *I [08:27:37] (and what was expected) [08:36:47] Let me add an alert to icinga https://i.imgflip.com/9utir0.jpg [09:01:04] Emperor: you were not wrong, DP alerts are no longer marked in the right group: https://alerts.wikimedia.org/?q=alertname%3Dsnapshot%20of%20x3%20in%20codfw&q=team%3Dsre&q=%40receiver%3Dirc-spam [09:01:38] :sadpanda: [09:04:49] do icinga alerts get a team? aren't the team just for the ones in the alerts repository? [09:05:55] noto by defualt but you can add configure that https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/prometheus/icinga_exporter.yaml [09:08:57] Emperor: wanna start a patch there, at least for swift and ceph? [09:16:08] not right now, have a couple of pressing things, but it does seem like worth doing [09:17:20] ofc [09:37:52] volans: we /used/ to see team-specific alerts (e.g. for disk-nearly-full) [09:39:29] marostegui: yeah, as long the weight is 0 it should be fine. My question actually is, why not just pool it if the table and grants are there? [09:40:17] Sure, I can do that too :) [09:40:25] I just wanted to make sure it is all fine from MW side [09:40:59] it should be, I need to add the purge piece but that can wait until end of the day (it runs once a day) [09:41:37] in the long term, I want to remove that altogether and leave mediawiki to issue a job for clean up but that's a bit longer [09:45:37] Amir1: ok, them I am going to pool them [09:46:40] 🍿 [09:49:55] Amir1: Pushing [09:50:27] wohoo [09:50:49] I see traffic [09:52:24] I don't see errors and application layer latency has gone up a bit which is expected [09:52:38] only thing is that it's not showing up in https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=$__all&var-role=$__all&from=now-1h&to=now&timezone=utc [09:52:56] Let me see [09:54:57] The hosts started to draw just now, so maybe it needs a bit? [09:55:05] From zarcillo point of view they look good [09:55:28] Ah no [09:55:30] I found the issue [09:55:31] Fixing [09:56:15] (unrelated, but sort or related: remember x3 hosts are not on zarcillo, only the backup sources) [09:56:47] Yeah, they are part of s8 still [09:56:52] Until we do the split [09:57:14] yeah, no worries. Just that I noticed it when adding the backups [09:57:18] Amir1: I fixed the zarcillo part for parsercache, I guess it will take a bit for the exporter to pick it up [09:57:29] that's already great, thank you! [09:59:18] Amir1: It is showing here https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-1h&to=now&timezone=utc&var-site=eqiad&var-group=$__all&var-shard=pc8&var-role=$__all but waiting for the exporter to pick up the move from those hosts from core to parsercache [09:59:56] \o/ [10:00:16] I'm working on adding the purger [10:03:22] Amir1: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-5m&to=now&timezone=utc&var-site=eqiad&var-group=parsercache&var-shard=$__all&var-role=$__all [10:03:26] It is now showing there [10:03:49] 🎉 [10:03:50] Thanks! [10:44:29] o/ I've been putting off migrating db_lag_stats_reporter to k8s but I'm going to do it now if that's okay https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143533 [10:44:38] what should I be keeping an eye on to monitor it? [11:07:28] hnowlan: reading the code, it goes to loadbalancer_lag_milliseconds and loadbalancer_lag_seconds in promtheus I think [11:08:16] but IIRC, gague stats don't work in promethues T394956 [11:08:17] T394956: Restore support for gauge metrics from MediaWiki PHP (post-Prometheus migration) - https://phabricator.wikimedia.org/T394956 [11:08:23] so it's probably already broken [11:10:54] ah :( migrating it shouldn't do much harm in that case, at least [15:20:24] marostegui: to check which datacenter is active can I look at the CNAME s1-master.eqiad.wmnet or elsewhere? [15:30:40] federico3: confctl --object-type mwconfig select 'name=WMFMasterDatacenter' get [15:30:51] thanks [15:31:08] (optional | jq -r '.WMFMasterDatacenter.val' if you want just the name in a script) [15:52:48] FIRING: PuppetFailure: Puppet has failed on thanos-be1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:57:48] FIRING: [3x] PuppetFailure: Puppet has failed on thanos-be1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:12:48] FIRING: [4x] PuppetFailure: Puppet has failed on thanos-be1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:33:48] ^-- new nodes still being installed by dc-ops; I'm guessing the install isn't behaving itself yet [16:52:12] [I've silenced for 24h any how] [20:38:17] PROBLEM - MariaDB sustained replica lag on s5 on db1159 is CRITICAL: 37 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1159&var-port=9104 [20:39:17] RECOVERY - MariaDB sustained replica lag on s5 on db1159 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1159&var-port=9104