[05:01:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db1155:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:18:58] <marostegui>	 Amir1: Let me know if I can create pc8 in dbctl 
[05:19:11] <marostegui>	 I believe it is being ignored at the moment by MW, but just to double check
[07:32:01] <Emperor>	 Are we expecting to not have the data-persistence team on alerts.wikimedia.org any more?
[07:39:35] <jynus>	 ?
[07:41:16] <jynus>	 !
[07:41:47] <jynus>	 I've seen it is gone for dbs, do you have an example for non-db alerts?
[07:52:12] <Emperor>	 Well, if I visit alerts.wikimedia.org and start typing team=dat into the search bar, I get offered data-engineering and data-platform. And even if I finish typing out team=data-persistence, the UI no longer lets me look at all our team's alerts
[07:52:45] <Emperor>	 [Maybe this is a consequence of us having no alerts even silenced ones right now, but that would be unexpected]
[08:14:59] <jynus>	 I will keep an eye for it in case it is only that
[08:24:38] <jynus>	 dbprov1005 backups failed due to not having the latest mariadb package- fixing
[08:25:57] <jynus>	 (not really failed, but aborted by logic)
[08:26:44] <Emperor>	 less failing, more deferring success? :)
[08:27:17] <jynus>	 can it be considered a failure if it did what it exactly told it to do?
[08:27:23] <jynus>	 *I
[08:27:37] <jynus>	 (and what was expected)
[08:36:47] <jynus>	 Let me add an alert to icinga https://i.imgflip.com/9utir0.jpg
[09:01:04] <jynus>	 Emperor: you were not wrong, DP alerts are no longer marked in the right group: https://alerts.wikimedia.org/?q=alertname%3Dsnapshot%20of%20x3%20in%20codfw&q=team%3Dsre&q=%40receiver%3Dirc-spam
[09:01:38] <Emperor>	 :sadpanda:
[09:04:49] <volans>	 do icinga alerts get a team? aren't the team just for the ones in the alerts repository?
[09:05:55] <taavi>	 noto by defualt but you can add configure that https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/prometheus/icinga_exporter.yaml
[09:08:57] <jynus>	 Emperor: wanna start a patch there, at least for swift and ceph?
[09:16:08] <Emperor>	 not right now, have a couple of pressing things, but it does seem like worth doing
[09:17:20] <jynus>	 ofc
[09:37:52] <Emperor>	 volans: we /used/ to see team-specific alerts (e.g. for disk-nearly-full)
[09:39:29] <Amir1>	 marostegui: yeah, as long the weight is 0 it should be fine. My question actually is, why not just pool it if the table and grants are there?
[09:40:17] <marostegui>	 Sure, I can do that too :)
[09:40:25] <marostegui>	 I just wanted to make sure it is all fine from MW side
[09:40:59] <Amir1>	 it should be, I need to add the purge piece but that can wait until end of the day (it runs once a day)
[09:41:37] <Amir1>	 in the long term, I want to remove that altogether and leave mediawiki to issue a job for clean up but that's a bit longer
[09:45:37] <marostegui>	 Amir1: ok, them I am going to pool them
[09:46:40] <Amir1>	 🍿
[09:49:55] <marostegui>	 Amir1: Pushing
[09:50:27] <Amir1>	 wohoo
[09:50:49] <marostegui>	 I see traffic
[09:52:24] <Amir1>	 I don't see errors and application layer latency has gone up a bit which is expected 
[09:52:38] <Amir1>	 only thing is that it's not showing up in https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=$__all&var-role=$__all&from=now-1h&to=now&timezone=utc
[09:52:56] <marostegui>	 Let me see
[09:54:57] <marostegui>	 The hosts started to draw just now, so maybe it needs a bit?
[09:55:05] <marostegui>	 From zarcillo point of view they look good
[09:55:28] <marostegui>	 Ah no
[09:55:30] <marostegui>	 I found the issue
[09:55:31] <marostegui>	 Fixing
[09:56:15] <jynus>	 (unrelated, but sort or related: remember x3 hosts are not on zarcillo, only the backup sources)
[09:56:47] <marostegui>	 Yeah, they are part of s8 still
[09:56:52] <marostegui>	 Until we do the split
[09:57:14] <jynus>	 yeah, no worries. Just that I noticed it when adding the backups
[09:57:18] <marostegui>	 Amir1: I fixed the zarcillo part for parsercache, I guess it will take a bit for the exporter to pick it up
[09:57:29] <Amir1>	 that's already great, thank you!
[09:59:18] <marostegui>	 Amir1: It is showing here https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-1h&to=now&timezone=utc&var-site=eqiad&var-group=$__all&var-shard=pc8&var-role=$__all but waiting for the exporter to pick up the move from those hosts from core to parsercache
[09:59:56] <Amir1>	 \o/
[10:00:16] <Amir1>	 I'm working on adding the purger
[10:03:22] <marostegui>	 Amir1: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-5m&to=now&timezone=utc&var-site=eqiad&var-group=parsercache&var-shard=$__all&var-role=$__all
[10:03:26] <marostegui>	 It is now showing there
[10:03:49] <Amir1>	 🎉
[10:03:50] <Amir1>	 Thanks!
[10:44:29] <hnowlan>	 o/ I've been putting off migrating db_lag_stats_reporter to k8s but I'm going to do it now if that's okay https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143533 
[10:44:38] <hnowlan>	 what should I be keeping an eye on to monitor it? 
[11:07:28] <Amir1>	 hnowlan: reading the code, it goes to loadbalancer_lag_milliseconds and loadbalancer_lag_seconds in promtheus I think
[11:08:16] <Amir1>	 but IIRC, gague stats don't work in promethues T394956
[11:08:17] <stashbot>	 T394956: Restore support for gauge metrics from MediaWiki PHP (post-Prometheus migration) - https://phabricator.wikimedia.org/T394956
[11:08:23] <Amir1>	 so it's probably already broken
[11:10:54] <hnowlan>	 ah :( migrating it shouldn't do much harm in that case, at least 
[15:20:24] <federico3>	 marostegui: to check which datacenter is active can I look at the CNAME s1-master.eqiad.wmnet or elsewhere?
[15:30:40] <Emperor>	 federico3: confctl --object-type mwconfig select 'name=WMFMasterDatacenter' get
[15:30:51] <federico3>	 thanks
[15:31:08] <Emperor>	 (optional | jq -r '.WMFMasterDatacenter.val' if you want just the name in a script)
[15:52:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:57:48] <jinxer-wm>	 FIRING: [3x] PuppetFailure: Puppet has failed on thanos-be1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:12:48] <jinxer-wm>	 FIRING: [4x] PuppetFailure: Puppet has failed on thanos-be1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:33:48] <Emperor>	 ^-- new nodes still being installed by dc-ops; I'm guessing the install isn't behaving itself yet
[16:52:12] <Emperor>	 [I've silenced for 24h any how]
[20:38:17] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1159 is CRITICAL: 37 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1159&var-port=9104
[20:39:17] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1159 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1159&var-port=9104