[00:36:37] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[02:08:03] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[04:36:38] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[06:08:04] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[07:42:49] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1133:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[07:53:49] <jinxer-wm>	 (SystemdUnitFailed) firing: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:24] <arnaudb>	 oops sorry Emperor Amir1 about the noise :/ I went to practice with IRC off 
[08:31:55] <jynus>	 s2 dumps on eqiad failed, checking them
[08:33:09] <jynus>	 "Could not execute query: Cannot load from mysql.proc. The table is probably corrupted"
[08:34:08] <jynus>	 this is db1239.eqiad.wmnet
[08:36:38] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[08:37:03] <jynus>	 do you know if db1139 used to have 10.6 ?
[08:37:24] <jynus>	 I am getting all kind of corruption issues
[08:38:38] <jynus>	 it doesn't seem so: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1000302/2/hieradata/hosts/db1139.yaml
[08:39:48] <arnaudb>	 10.4.28+deb11u1
[08:39:55] <arnaudb>	 was the last version of mariadb installed on db1139
[08:40:39] <jynus>	 I wonder then what could cause the issues on db1239 compared to db1139
[08:42:25] <arnaudb>	 it's the same version 🤔 I've cloned those using https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976709; so it runs mysql_upgrade systematically as discussed a few month ago
[08:42:32] <arnaudb>	 which is a noticeable difference
[08:42:49] <jynus>	 you didn't clone that one
[08:43:12] <arnaudb>	 oh you're right it's multi-instance but backup source
[08:44:47] <jinxer-wm>	 (SystemdUnitFailed) resolved: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:45:03] <jynus>	 checking
[08:46:05] <jynus>	 that's not me, that is smart, nothing to do with backups
[08:47:41] <jynus>	  it is dead and very slow, hw disk issue?
[08:48:11] <arnaudb>	 lets run a long smart test
[08:50:50] <jynus>	 it looks ok, back to dbprov1004
[09:26:22] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[09:35:21] <Emperor>	 I've opened https://gerrit.wikimedia.org/r/c/operations/alerts/+/1002925 to have those alerts go directly to serviceops (rather than to us, and then we ask them for help)
[09:41:24] <Emperor>	 ...and merged, so hopefully we won't get any more of them
[09:52:25] <jynus>	 arnaudb: I am going to retry the backup- I think there was a non-data issue- I restarted and upgraded the host, and will see if that succeeds the backup
[09:52:40] <arnaudb>	 noted! thanks for your investigation :)
[09:58:42] <jynus>	 arnaudb: also good news- the dump hadn't really failed, but as the heuristics checks the logs for errors and consider a backup with those not successful, it had alerts as an unssuccesful backup
[09:58:59] <jynus>	 *alerted
[09:59:07] <jynus>	 so backup checks working nicely
[10:00:31] <arnaudb>	 how is it configured by the way?
[10:00:44] <jynus>	 what do you mean?
[10:04:13] <arnaudb>	 to be a bit more specific: is there a notion of severity? those logs may be worth alerting over but  it does not seem obvious to figure out what was triggering the alert
[10:06:05] <jynus>	 "is there a notion of severity" - I still don't understand the question. Do I have a notion? Do they alert? What do you mean?
[10:08:25] <arnaudb>	 sorry I'm not clear enough, in the log parsing that's performed by export_smart_data_dump.service (I haven't read the config) I was curious to know if there was a way to alert according to different thresholds or with a bit more metadata to make analysis easier as this one did not seemed explicit 
[10:09:32] <jynus>	 ah, I know nothing about that- that is not an alert we have setup, to my understanding
[10:09:44] <jynus>	 I was talking about the backup issue on a different host
[10:10:07] <jynus>	 export_smart_data_dump.service was on dbprov1003
[10:10:17] <jynus>	 the backup issue was related to dbprov1004
[10:16:08] <arnaudb>	 oh 
[10:16:50] <jynus>	 hence why I was confused :-D
[11:43:04] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1133:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[11:44:20] <arnaudb>	 it will be decom, I'll mute it
[11:45:13] <jynus>	 Emperor and others: if for some reason we *have* to keep the IRC alerts (because reasons), we should move those to -feed (?)
[11:45:42] <arnaudb>	 it makes sense to me
[11:46:04] <jynus>	 let's wait first for the response on the ticket
[11:46:55] <Emperor>	 yeah, I think let's see what the response is; if we have other classes of alerts we'd like treated similarly [e.g. these PuppetZeroResources?], could folk comment on the phab ticket to that effect, please?
[11:48:46] <jynus>	 I think the issue is alertmanager tooling (not just an SRE issue) with downtime recipe not affecting it (only icinga)- but I may be wrong
[11:48:55] <jynus>	 *not just a data persistence issue
[11:50:01] <jynus>	 maybe it works and it just expired, idk
[11:50:18] <Emperor>	 jynus: I think the downtime cookbook is working
[11:50:41] <Emperor>	 I downtimed db1133 yesterday evening (about 18:42 for 13 hours)
[11:50:52] <Emperor>	 but that downtime has now expired
[11:50:56] <jynus>	 I see
[11:51:39] <Emperor>	 I didn't want to silence it for too long, as I'm not the person working on it (hi arnaudb :) ), but thought there was no value in us getting poked about it overnight
[11:52:11] <jynus>	 wait, when you said downtime, did you use the cookbook or manually on alertmanager?
[11:52:18] <Emperor>	 cookbook
[11:52:27] <jynus>	 ok, then it works indeed
[11:52:53] <Emperor>	 "!log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 13:00:00 on db1133.eqiad.wmnet with reason: hush" from #wikimedia-operations :)
[11:53:07] <jynus>	 I am guessing the issue is that non-host aggregated stats are not sometimes handled
[11:53:27] <jynus>	 e.g. "widespread puppet failures" on a cluster with 2 hosts only
[12:01:24] <jynus>	 arnaudb: I've marked all backup hosts as ready for network maintenance
[12:02:10] <jynus>	 also s2 backup succedded after giving some hugs to the db
[12:22:19] <hoo>	 Amir1: In Cognate we use "ConnectionManager". I would like to construct that with an external LB from virtual domain mapping, but there's no nice option for this right now.
[12:22:45] <Amir1>	 hoo: I'd say you should stop using CM altogether
[12:22:58] <Amir1>	 the idea is basically to get rid of that and replace it with ICP
[12:23:27] <Amir1>	 hoo: Also LBs shouldn't be used directly either
[12:26:05] <hoo>	 Sounds good
[12:27:23] <Amir1>	 yeah, basically it should simplify things a lot, just inject ICP (getConnectionProvider service) and call that
[12:29:04] <hoo>	 We either use virtual mapping or we allow for using the default domain… not sure how to handle this.
[12:29:23] <hoo>	 Is it fine to assume that the local DB will be used if no virtual mapping is set up?
[12:30:54] <Amir1>	 yup
[12:31:04] <hoo>	 Great, thanks :)
[12:31:10] <Amir1>	 that's the point basically, localhost and tests will use local
[12:31:32] <Amir1>	 in prod, you point it to x1
[12:33:12] <Amir1>	 hoo: Maybe looking at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReadingLists/+/985162 helps as an example
[13:04:56] <Amir1>	 afk for a bit
[14:59:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on restbase1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:02:20] <Emperor>	 urandom: ^-- dunno if its worth adding some downtime for the new restbase nodes?
[15:29:39] <urandom>	 Emperor: these keep cropping up as dcops brings them online.  I think they're imaging them as Puppet 7, but the insetup role is Puppet 5.  I'll fix it.
[15:31:10] <urandom>	 ...and, I think that particular one is being re-reimaged as we speak...
[15:35:24] <volans>	 why the discrepancy in puppet version?
[15:38:13] <urandom>	 I  don't knowo
[15:38:14] <urandom>	 I  don't know
[15:38:38] <urandom>	 it comes up as Puppet 7, using an insetup that is still 5
[15:38:43] <urandom>	 not sure how it gets that way
[15:41:14] <urandom>	 based on T354893, the host in question —restbase1035— has been a... journey
[15:41:14] <stashbot>	 T354893: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893
[15:43:04] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:51:42] <jynus>	 I am restarting codfw media backups, checking everything comes back well
[16:54:32] <topranks>	 jynus: thanks!
[16:59:37] <jynus>	 INFO:backup 487 new files found on this batch
[16:59:45] <jynus>	 INFO:backup 487 files were inserted correctly
[17:00:09] <jynus>	 followed, of course, by the usual query timeout: ERROR:backup Error returned by the API call: {'code': 'internal_api_error_DBQueryTimeoutError', 'info': '[981ee18d-d47b-45b4-8ce3-fb308ec75b0d] Caught exception of type Wikimedia\\Rdbms\\DBQueryTimeoutError', 'errorclass': 'Wikimedia\\Rdbms\\DBQueryTimeoutError'}
[17:00:47] <jynus>	 (but I just hit production until it works)
[17:20:03] <Amir1>	 :(
[17:20:19] <Amir1>	 This year I will improve the file db storage in mw, 
[17:24:45] <jynus>	 no, Amir, this year we will all do what we can, and we will be happy about that :D
[17:25:42] <Amir1>	 fair. It's in the roadmap, we might get to it, we might not