[06:10:09] I have renamed revision_actor_temp table in production on s6 (frwiki, jawiki,ruwiki, wikitech) and s8 (wikidata). I did a few hosts a week ago and all went fine, but if you notice any errors on any of those wikis, please ping me https://phabricator.wikimedia.org/T307906 [06:21:30] <_joe_> marostegui: I'd write to ops@ frankly [06:22:16] <_joe_> of all of SRE I think 3, maybe 4 people look at mediawiki errors with any regularity [06:23:27] _joe_: Right, I will do that [07:21:31] Krinkle: as I understand it it has 3 modes and a Prometheus server is not strictly needed. The 1st and 2nd mode are meant to be run in CI checking all/modified checks. [07:21:37] [cit.] """Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.""" [10:08:32] hi all im going to temporarily disable puppet fleet wide to preform puppetmaster/db reboots [10:10:33] ack [10:45:14] fyi in relation to the spdx stuff i have now added a spdx:convert:role and spdx:convert:profile tasks to convert roles and profiles simlar to how the module one works [10:45:26] thanks to vol.ans for the suggestion and reviews [11:30:10] disabling puppet again i missed puppetmaster1001 [12:59:54] the kafka uid/gid task is done, now we have the same config in deployment-prep and prod (and sane defaults in puppet for future clusters) [13:00:08] it was a no-op everywhere but ping me in case you see anything weir [13:00:10] *weird [13:00:29] very nice! that was a long time coming :-) [13:00:51] yeah by bad I postponed deployment-prep for too long :) [13:00:54] ack, thanks! [13:00:56] Thanks elukey, much appreciated. [13:01:11] mayyybe in the future we'll be able to have pki certs as well, we'll see :) [13:08:28] Created the incident report for today's incident: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart not sure about the actionables, probably need subject expert matters there to fill stuff in :) (cc lmata ) [13:10:36] I will do the scoreboard, actionables I would say jbond ? [13:10:51] I did some of the most obvious scorecard fields [13:12:02] ack will take a look, im not sure there are any actionables for this. the main faliure was that i self +2'ed [13:12:26] looks to me like an actionable 0:-) [13:12:35] jbond: I added that as a something that went poorly [13:12:51] e.g. "always get a review for apache changes" or something like that [13:12:51] yes which i think is fine i dont see it as an actionable though [13:13:25] we can add it but i personaly dont think it adds value to the report. i.e. nothing will get actioned from that, no task etc [13:13:33] Yeah, I agree [13:13:38] It is hard to enforce that [13:13:41] reminder: not all has to be a great automation fixing everything [13:13:50] most of those never get done [13:15:40] marostegui: thank you! [14:27:29] thank you elukey ! [17:55:39] 2 rsync jobs. writing to the same file system on the remote side. same config for the fragments except path. one of them works just fine.. the other claims the remote file system is read-only .. which it isn't.. wtf (again) [18:01:23] hmmm [18:01:38] I was pushing a manual DNS change, and authdns-update failed on some zone validator stuff unrelated to my change [18:01:41] E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.b.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa.' for name 'irb-1116.cloudsw1-c8-eqiad.eqiad.wmnet.' and IP '2620:0:861:fe0b::1', PTRs are: 12.147.64.10.in-addr.arpa. [18:01:46] E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.b.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa.' for name 'irb-1116.cloudsw1-d5-eqiad.eqiad.wmnet.' and IP '2620:0:861:fe0b::2', PTRs are: 13.147.64.10.in-addr.arpa. [18:01:57] shouldn't CI or something have caught this earlier and not deployed it? [18:02:21] digging for causal change [18:03:26] topranks: looks like you edited some related netbox data earlier today, I'm guessing that's the mechanism [18:03:42] still, something should've prevented pushing a change that doesn't validate [18:03:49] !log cp6006 in maint mode and depooled for memory troubleshooting via T309123 [18:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] T309123: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 [18:04:01] bblack yep [18:04:06] there is a patch here: https://gerrit.wikimedia.org/r/c/operations/dns/+/798893 [18:04:14] if you can review I'll merge should fix it [18:04:48] well, I should probably try to merge both, since my patch is now merged first and breaking deploy [18:05:13] ok yeah that probably makes sense. [18:06:32] still, there's some kind of process and/or validatoin issue here that it can happen at all [18:06:38] *validation [18:08:35] all is well now [18:09:07] Yeah I'll discuss with volans tomorrow, previously it didn't matter what order these additions happened in [18:09:34] ack [18:09:38] With the zone validator script now we need to make sure all 'includes' for files netbox may create are added to the repo and merged before the entries in netbox are added that will create the files. [18:10:17] yeah but if we merge (via repo + authdns-update) a pair of includes for files that don't yet exist in the netbox exports, basic DNS daemon validation will fail (can't load the zonefiles with missing includes) [18:10:29] yeah exactly was just about to say [18:11:06] maybe there's a way to hook up creation of empty includes on creation of the subnet/zone/whatever in netbox, and then glue up the dns part, before defining any IPs [18:11:07] The answer may be that the zone validator fires a warning but continues rather than stops [18:11:13] !log cp6006 memory issue resolved, returned system to service and ended maint window via T309123 [18:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:19] T309123: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 [18:11:45] bblack: yes not a bad idea, I guess the tricky part there is combining the auto-generated with the manually added parts of the zonefiles. [18:12:19] yeah [18:12:51] in some ideal world we'd just be split at the zone level. some zones are manual and some aren't. I imagine such a state isn't realistic, though. [18:13:15] netbox doesn't fature-claim to want to be a complete dns management tool for any kind of whole zone anyways. [18:13:19] *feature-claim [18:13:57] in either case, no matter which way the procedure ends up working, there's also a validation mismatch [18:14:36] as in: somehow netbox exporting its stuff after creation of those IPs, didn't fail some validation step that halted exporting further netbox updates towards authdns visibility. [18:15:00] or alternatively, my patch could've failed validation in jenkins instead, preventing me from trying to merge (might be easier, even if it's not as direct) [18:15:19] hmm yeah. I wonder if an approach might be that Netbox doesn't try to create the file if there is no include for it. [18:15:31] it probably would have on a re-check, but I already had Jenkins +2 before the problem edits in netbox, and since there was no ops/dns merge since, I didn't need to rebase before merging away [18:15:34] or yeah, similar logic in Jenkins [18:16:27] there's a bunch of ways to hack something up to make it fail somewhere in this case, but a lot of tradeoffs in each [18:17:07] yeah. luckily it's rare-ish we need to add new include entries, but with the subnet-per-rack model it's more common than it used to be. [18:44:47] * volans catching up with backlog bblack, topranks [18:45:50] so first of all, this is a direct consequence of the fact that we're now validating also netbox data in the dns repo CI, at same time this has allowed to catch various incongruences in our dns repo, so there is a trade off there [18:46:27] a very quick "solution"could be to have the operations/dns repo do gate-and-submit, so that jenkins will do the submit only if CI is passing again after the +2 [18:47:13] another helper on the other side of the problem is: [18:47:13] https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change [18:47:56] but I'll check first what were the failures before saying anything more, I need more details :) [18:47:57] would there still be a race condition with gate-and-submit, though, since they're seperate repos? [18:48:38] I guess so, but the timeframe would be a bunch of seconds I think [18:48:42] yeah [18:48:56] so the dns cookbook did fail [18:48:56] cmooney@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:49:00] I'm checking the logs [18:49:39] the cookbook fails after making a commit to the repo that will fail validation? [18:50:01] utils/deploy-check.py -g /srv/git/netbox_dns_snippets --deploy returned exit code 1 [18:50:34] volans: I was afk there [18:50:51] This is what happened when I tried to run DNS cookbook to create entries for new hostnames I added [18:50:52] https://phabricator.wikimedia.org/P28460 [18:51:18] I hadn't merged the patch to add the 'include' to the zonefile at that stage [18:51:38] right [18:52:11] but still, the flow of the cookbook's validation is commit-then-check, I think that's at the root here (but I imagine it's that way around because doing otherwise would add a ton of complexity) [18:52:12] Not realising it would now cause an error. Estimating I'm guessing that's about 20 mins before bblack pinged me here when he hit an error [18:52:44] bblack: I was about to propose to add a check before committing [18:53:12] we could run the easy one without gdnsd, just having a local copy of the dns repo on the cumin hosts [18:53:20] sorry, the netbox hosts [18:53:36] hmm yeah [18:53:37] and we could run inside that the deploy-check script [18:54:02] the other thing I'd like to know is if the procedure outlined in [18:54:03] https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change [18:54:10] would have worked and not broke things [18:54:28] yes it still has a race if others merge anything in the ops/dns repo [18:54:44] and is described there in the doc [18:55:02] volans: sry no I didn't follow that process [18:55:15] I'm not sure that process completely works in this case either. It does help though [18:56:00] I did 1 and 2 in that order, but at step 3 I ran the cookbook as normal, I'm thinking if I did with the --skip-authdns-update it may have completed [18:56:39] from your paste it would have, yes [18:56:55] hmmm I guess it would've succeeded actually, because the cookbook doesn't currently validate other than manually asking the user, and running the update, right? [18:57:10] yes in its current state [18:57:14] but it still would've left us in a broken state [18:57:23] repo wise yes, DNS wise no [18:57:38] in that it would've committed, then failed, and then required some kind of netbox-data-rollback or ops/dns commit to fix before anyone can do normal updates to the repo again [18:57:50] correct [18:57:50] (without bypassing CI checks) [18:58:00] that's what the disclaimer below says too :) [18:58:06] * volans should make it bold [18:58:12] or a warning block [18:59:01] the pre-commit check you mentioned earlier would take a lot of risks out of this in general though [18:59:16] but yes I guess that adding a CI run to the generate dns script is a wise move [18:59:28] I can imagine a scenario where someone walks away from a cookbook failure and someone else discovers it hours later while trying to do some outage emergency dns patch [18:59:38] but we'll need a way to skip it in case of wanting explicitly to perform the above mentioned operation [19:00:17] the procedure still should work with the pre-commit check right? [19:00:19] that's needed for example when moving a zonefile from manual to netbox-managed, where we need to atomically move the records from one repo to the other [19:00:25] ah [19:00:27] I think it would fail [19:00:31] the CI check [19:00:59] I could also feature-update gdnsd to make some easier tradeoffs here [19:01:07] two things I can think of: [19:01:14] like igore INCLUDE with files that don't exist? [19:01:17] *ignore [19:01:41] 1) I could add a wildcard $INCLUDE statement for a whole directory of includes. That might complicate some things about the existing model of staging things, but still, it makes some other bits easier. [19:02:01] 2) Or alternatively, yeah, have some kind of option to ignore non-existent includes [19:02:16] both are a little dangerous in some sense, in that we have no gdnsd-level validation failure if an important file goes missing [19:02:42] could make it optional at the per-$INCLUDE-statement level though and make it part of a two-stage process [19:03:11] something like $INCLUDE_IGNORE :D [19:03:12] commit to ops/dns with $INCLUDE-OPTIONAL netbox/foo, push new data, then commit again to swap to $INCLUDE netbox/foo [19:03:15] or something of that nature [19:04:11] not the cleanest workflow but it's straightforward enough I think [19:04:25] with the cookbook failing though [19:04:40] so to block the broken change [19:04:55] ok I'll look into adding the "CI" check to the cookbook [19:05:00] if the cookbook fails pre-validation, it won't push the commit, the file never exists, and the second ops/dns commit won't validate either [19:05:22] yep [21:19:45] Hello folks I have a quick question: could anyone share an example of the alerting setup with prometheus? I'm learning the ropes of prometheus for a new project:) [21:19:54] ty! [21:40:52] maryyang: I'm not sure if it's exactly what you're looking for, but this a great place to start: https://wikitech.wikimedia.org/wiki/Alertmanager [21:52:58] note: don't try to rsync files directly into /etc/ on a remote server. you can use /foo or /srv/whatever.. but not /etc. it will fail