[07:05:19] majavah: https://phabricator.wikimedia.org/T297094#7572072 [09:37:46] marostegui: i see in https://gerrit.wikimedia.org/r/c/operations/software/+/747681 that you removed `--connect-timeout $DB_TIMEOUT` - was that intentional? [09:37:54] it'll work with db-mysql, you just need to put it after the instance [09:38:04] so `db-mysql db1163 --connect-timeout 5` works fine [09:38:45] (you can also put it before the instance _if_ you use a =, `db-mysql --connect-timeout=5 db1163`. but it's just easier in general to put the instance first.) [09:42:13] marostegui: for https://gerrit.wikimedia.org/r/c/operations/software/+/747682, you changed the output substantially [09:42:22] (and also pointed it at the wrong db instance) [09:55:42] kormat: mmmm, looks like I sent the wrong commit to that one indeed [09:55:46] I will fix it [09:55:54] as per the --connect-timeout, yeah, that was intentional [09:56:41] I use a slightly modified version of section [09:56:46] And I committed that one by mistake [09:56:54] you monster [09:57:24] Which is still better than the one in the repo I think! But I will go back to the "normal" one so I don't mess with people's version :) [09:57:52] marostegui: i'm fine with the different output [09:58:32] i don't think i really use 'section' anyway [09:59:07] I use it all the time XD [10:02:28] kormat: https://gerrit.wikimedia.org/r/c/operations/software/+/747799/ [10:07:20] kormat: https://phabricator.wikimedia.org/T297618#7574748 XD [10:07:34] * kormat sighs [10:17:26] marostegui: any objection to me re-deploying the sys schema to db1124? [10:17:31] (to test a change to dbtools/sys/apply) [10:17:41] kormat: go for it! [10:19:04] victory [11:24:51] I'm gonna merge a big change and run puppet on all db hosts. Stay tuned. [11:25:31] * kormat hides [11:28:58] root@cumin1001:~# cumin -b 1 -s 30 'P:mariadb::mysql_role' 'puppet agent -tv' [11:28:59] 220 hosts will be targeted: [11:29:07] 😭 [11:29:24] Amir1: use run-puppet-agent [11:29:30] you shoul dnever use puppet agent manually [11:29:35] *directly [11:30:04] volans: thanks, where is the docs for it [11:30:41] volans: oh the optimism in this one [11:30:44] not sure there is, is mentioned in most wikitech docs for many services [11:30:59] and your onboarding buddy should have told you ;) [11:32:09] from my side I will refer to cumin's docs :D [11:32:09] https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_discarding_the_output [11:32:25] okay found it, thanks [11:32:56] and yes we should probably add a pragraph to puppet's wikitech page somewhere [11:33:04] kormat: speaking of which, we need to mark may last onboarding checklist thingy: "have a meeting with your buddy" [11:33:12] also Amir1, if you really want to deploy a change gradually [11:33:29] you should disable puppet, then merge, then run run-puppet-agent with the option to re-enable puppet [11:33:50] disabling with 'disable-puppet' [11:34:06] it's just adding a file so it's not that important [11:34:13] (unused file) [11:34:56] so it's not a matter of safety of teh change? [11:35:03] do you just need it quickly deployed? [11:35:16] otherwise I would just let puppet do it's work and in 30m you'll hav eyour file [11:36:34] no, the puppet change won't break the system but still might fail to apply. I want to notice it [11:36:40] in some edge cases [11:37:37] two possible alternatives: [11:38:02] keep an eye on grafana: https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?viewPanel=3&orgId=1 [11:38:09] or keep an eye on icinga [11:38:59] if you do run puppet though I suggest you bump the -b up to 10 or similar or will take ages to complete ;) [11:39:56] yeah, I need to do that [11:40:10] what is "no resources"? [11:40:14] for what is worth, it ran fine on db1163 [11:40:36] agent failed is scary but is this fine? [11:41:02] ah 3rd option keep an eye on puppetboard :D [11:41:18] https://puppetboard.wikimedia.org/nodes?status=failed [11:41:24] pc and db there ;) [11:41:54] Amir1: usually means it couldn't compile at all [11:42:06] Could not find template 'profile/mariadb/grants/production-parsercache.sql.erb' [11:42:32] ugh, okay, let me try to fix it [11:45:34] is this going to alert? [11:46:26] it's only on backups and pc [11:46:33] if it reaches some % [11:46:34] yes [11:46:40] I don't recall the threshold [11:50:31] it should be small-ish, m hosts, parsercache, [11:51:39] fix merged [11:51:59] running puppet [11:52:17] I am going to take a long lunch break [11:52:25] I will have my laptop with me [11:55:32] Amir1: https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed is always your friend ;) [11:55:54] and you can ofc reduce the target to your original selection [11:58:06] thanks. it's now all passing [12:02:03] :) [12:02:14] it alerted but not on VO, that's good :D [12:03:17] ah no, no pages for that [12:03:21] just icinga [12:03:50] https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?viewPanel=3&orgId=1&from=now-1h&to=now [12:03:54] it's recovering [12:06:42] puppetdb says hosts are failing is now 28 (was 38 a couple minutes ago) [12:10:31] 19 [15:41:05] marostegui: https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-mediawiki-2021.12.16?id=BCTjw30BtgVOxr_Gw34n [15:42:47] interesting, that query took me 3 seconds to run [15:43:18] it can be cold cache or whatever [15:43:26] yeah for sure [15:43:33] it has the max time, that's what I'm happy about [15:43:51] hehe yeah I noticed [16:20:50] godog: ms-be2065 has a sad drive, it looks like swift-drive-audit is confused, though - Errors found but device unavailable: sdq:12 (so it fails to umount & comment out in fstab) [16:33:40] Emperor: sigh, I'm taking a look too [16:34:10] godog: thanks [16:40:36] Emperor: not sure off the bad why drive-audit isn't finding the device heh, though the failure looks real [16:40:48] refusing to umount rather, not "not finding" [16:45:19] AFAICT "Errors found but device unavailable" means it didn't try to umount? [16:48:57] looks like that to me too, didn't even try [16:49:54] I think because it thinks the device from the errors was "sdq:12" rather than "sdq"? [16:53:16] mmhh 12 looks like the error count, from the formatting string [16:57:39] I have to go shortly, though drive-audit aside it looks like to me umounting and failing the drive on the controller then will trigger the right events (e.g. a task to dcops) [16:59:39] I've also go to vanish very imminently. Last time I Did This Wrong, and DCops needed a lot of persuading the disk was actually bad enough to replace [17:04:28] ok I think tomorrow's fine too [17:49:56] marostegui: https://phabricator.wikimedia.org/T297147#7575942 [19:09:00] cool!!! good one!