[00:36:37] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [02:08:03] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:36:38] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [06:08:04] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:42:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1133:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:53:49] (SystemdUnitFailed) firing: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:24] oops sorry Emperor Amir1 about the noise :/ I went to practice with IRC off [08:31:55] s2 dumps on eqiad failed, checking them [08:33:09] "Could not execute query: Cannot load from mysql.proc. The table is probably corrupted" [08:34:08] this is db1239.eqiad.wmnet [08:36:38] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [08:37:03] do you know if db1139 used to have 10.6 ? [08:37:24] I am getting all kind of corruption issues [08:38:38] it doesn't seem so: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1000302/2/hieradata/hosts/db1139.yaml [08:39:48] 10.4.28+deb11u1 [08:39:55] was the last version of mariadb installed on db1139 [08:40:39] I wonder then what could cause the issues on db1239 compared to db1139 [08:42:25] it's the same version 🤔 I've cloned those using https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976709; so it runs mysql_upgrade systematically as discussed a few month ago [08:42:32] which is a noticeable difference [08:42:49] you didn't clone that one [08:43:12] oh you're right it's multi-instance but backup source [08:44:47] (SystemdUnitFailed) resolved: export_smart_data_dump.service on dbprov1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:03] checking [08:46:05] that's not me, that is smart, nothing to do with backups [08:47:41] it is dead and very slow, hw disk issue? [08:48:11] lets run a long smart test [08:50:50] it looks ok, back to dbprov1004 [09:26:22] (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [09:35:21] I've opened https://gerrit.wikimedia.org/r/c/operations/alerts/+/1002925 to have those alerts go directly to serviceops (rather than to us, and then we ask them for help) [09:41:24] ...and merged, so hopefully we won't get any more of them [09:52:25] arnaudb: I am going to retry the backup- I think there was a non-data issue- I restarted and upgraded the host, and will see if that succeeds the backup [09:52:40] noted! thanks for your investigation :) [09:58:42] arnaudb: also good news- the dump hadn't really failed, but as the heuristics checks the logs for errors and consider a backup with those not successful, it had alerts as an unssuccesful backup [09:58:59] *alerted [09:59:07] so backup checks working nicely [10:00:31] how is it configured by the way? [10:00:44] what do you mean? [10:04:13] to be a bit more specific: is there a notion of severity? those logs may be worth alerting over but it does not seem obvious to figure out what was triggering the alert [10:06:05] "is there a notion of severity" - I still don't understand the question. Do I have a notion? Do they alert? What do you mean? [10:08:25] sorry I'm not clear enough, in the log parsing that's performed by export_smart_data_dump.service (I haven't read the config) I was curious to know if there was a way to alert according to different thresholds or with a bit more metadata to make analysis easier as this one did not seemed explicit [10:09:32] ah, I know nothing about that- that is not an alert we have setup, to my understanding [10:09:44] I was talking about the backup issue on a different host [10:10:07] export_smart_data_dump.service was on dbprov1003 [10:10:17] the backup issue was related to dbprov1004 [10:16:08] oh [10:16:50] hence why I was confused :-D [11:43:04] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1133:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:44:20] it will be decom, I'll mute it [11:45:13] Emperor and others: if for some reason we *have* to keep the IRC alerts (because reasons), we should move those to -feed (?) [11:45:42] it makes sense to me [11:46:04] let's wait first for the response on the ticket [11:46:55] yeah, I think let's see what the response is; if we have other classes of alerts we'd like treated similarly [e.g. these PuppetZeroResources?], could folk comment on the phab ticket to that effect, please? [11:48:46] I think the issue is alertmanager tooling (not just an SRE issue) with downtime recipe not affecting it (only icinga)- but I may be wrong [11:48:55] *not just a data persistence issue [11:50:01] maybe it works and it just expired, idk [11:50:18] jynus: I think the downtime cookbook is working [11:50:41] I downtimed db1133 yesterday evening (about 18:42 for 13 hours) [11:50:52] but that downtime has now expired [11:50:56] I see [11:51:39] I didn't want to silence it for too long, as I'm not the person working on it (hi arnaudb :) ), but thought there was no value in us getting poked about it overnight [11:52:11] wait, when you said downtime, did you use the cookbook or manually on alertmanager? [11:52:18] cookbook [11:52:27] ok, then it works indeed [11:52:53] "!log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 13:00:00 on db1133.eqiad.wmnet with reason: hush" from #wikimedia-operations :) [11:53:07] I am guessing the issue is that non-host aggregated stats are not sometimes handled [11:53:27] e.g. "widespread puppet failures" on a cluster with 2 hosts only [12:01:24] arnaudb: I've marked all backup hosts as ready for network maintenance [12:02:10] also s2 backup succedded after giving some hugs to the db [12:22:19] Amir1: In Cognate we use "ConnectionManager". I would like to construct that with an external LB from virtual domain mapping, but there's no nice option for this right now. [12:22:45] hoo: I'd say you should stop using CM altogether [12:22:58] the idea is basically to get rid of that and replace it with ICP [12:23:27] hoo: Also LBs shouldn't be used directly either [12:26:05] Sounds good [12:27:23] yeah, basically it should simplify things a lot, just inject ICP (getConnectionProvider service) and call that [12:29:04] We either use virtual mapping or we allow for using the default domain… not sure how to handle this. [12:29:23] Is it fine to assume that the local DB will be used if no virtual mapping is set up? [12:30:54] yup [12:31:04] Great, thanks :) [12:31:10] that's the point basically, localhost and tests will use local [12:31:32] in prod, you point it to x1 [12:33:12] hoo: Maybe looking at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReadingLists/+/985162 helps as an example [13:04:56] afk for a bit [14:59:48] (PuppetFailure) firing: Puppet has failed on restbase1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:02:20] urandom: ^-- dunno if its worth adding some downtime for the new restbase nodes? [15:29:39] Emperor: these keep cropping up as dcops brings them online. I think they're imaging them as Puppet 7, but the insetup role is Puppet 5. I'll fix it. [15:31:10] ...and, I think that particular one is being re-reimaged as we speak... [15:35:24] why the discrepancy in puppet version? [15:38:13] I don't knowo [15:38:14] I don't know [15:38:38] it comes up as Puppet 7, using an insetup that is still 5 [15:38:43] not sure how it gets that way [15:41:14] based on T354893, the host in question —restbase1035— has been a... journey [15:41:14] T354893: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 [15:43:04] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1135:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:51:42] I am restarting codfw media backups, checking everything comes back well [16:54:32] jynus: thanks! [16:59:37] INFO:backup 487 new files found on this batch [16:59:45] INFO:backup 487 files were inserted correctly [17:00:09] followed, of course, by the usual query timeout: ERROR:backup Error returned by the API call: {'code': 'internal_api_error_DBQueryTimeoutError', 'info': '[981ee18d-d47b-45b4-8ce3-fb308ec75b0d] Caught exception of type Wikimedia\\Rdbms\\DBQueryTimeoutError', 'errorclass': 'Wikimedia\\Rdbms\\DBQueryTimeoutError'} [17:00:47] (but I just hit production until it works) [17:20:03] :( [17:20:19] This year I will improve the file db storage in mw, [17:24:45] no, Amir, this year we will all do what we can, and we will be happy about that :D [17:25:42] fair. It's in the roadmap, we might get to it, we might not