[06:36:43] FYI, the primary link to DRMRS is down, and while it's backup is finally up, it's waiting on the provider to fix the MTU on their side, so some checks are not going through [07:25:50] that's most likely what causes the debmonitor email spam [07:28:51] ah thanks, I had been wondering whether to report that this morning [08:15:32] XioNoX: yes, the debmonitor spam is because it's a daily cron/timer and couldn't connect to debmonitor, so generated output that AIUI was later delivered as email [08:15:38] and all failures are for drmrs hosts [08:16:17] alright [08:16:45] GTT is working on the MTU and Telxius is working on the connectivity issue (see emails to noc and maint-announce) [08:18:03] yep, saw them [08:18:16] are we "just" getting unlucky? [08:18:26] connectivity at drmrs hasn't been the best so far ;) [08:19:55] well, I'd say we're lucky that GTT is finally up :) [09:07:59] Heh. Also very lucky these niggles happened before the site was handling live traffic. [09:29:16] <_joe_> it's not uncommon to have such issues when a site is set up [09:29:27] <_joe_> I remember eqsin having similar issues at first [09:31:28] FYI the sre.dns.netbox cookbook is currently failing on the drmrs hosts because they can't reach netbox-exports.wikimedia.org/dns.git to update their netbox-generated data [10:19:01] Hello, there seems to be an issue with deployment mediawiki as reported in task https://phabricator.wikimedia.org/T302699, on the cache host connection to `deployment-mediawiki11` is intermittent [10:38:16] who handles deployment-mediawiki instances? [10:45:01] vgutierrez: officially no-one [10:45:35] (https://phabricator.wikimedia.org/T215217 is over 3 years old at this point) [10:50:57] thx [12:24:52] <_joe_> mmandere: jayme or I will eventually take a look during the day [12:29:30] <_joe_> I see everything is ok now, I guess there was database slowness from what I see in the slowlog [12:29:49] _joe_: apache was struggling there and I restarted it [12:30:51] <_joe_> vgutierrez: that did nothing for the problem I think, the problem was actually that the parsoid appserver was struggling [12:31:01] <_joe_> I got confused with utc times [12:31:10] ack [12:33:49] <_joe_> apparently mediawiki calls restbase that calls parsoid which is overall... mediawiki [12:37:31] _joe_: ack, thank you :) [14:57:53] volans: so due to the DNS situation in Marseille a VM creation I just tried failed. Given that Luca and I also realized I got the name wrong, I was about to delete it using the decom cookbook. Is that the right thing to do in this case (name is ml-etcd-staging2001) [14:59:22] klausman: depends, I'm jumping into a meeting, can explain in few minutes [14:59:32] ok, sure [14:59:45] but also, you run it from the wrong host (cumin2001). moritzm didn't we remove spicerack from there more than once? why is still there? [15:00:05] How is 2001 wrong? [15:00:20] klausman: cumin2002 is the correct host for codfw, [15:00:21] 1001/2002 are active [15:00:27] 2001 is only there for DBA needs [15:00:43] and it shoudl not even work [15:00:47] not sure why it is, that's wrong [15:01:00] and must be fixed [15:04:44] Also maybe a big fat warning in MOTD? [15:06:07] T276589 for context, I'll get back to you with the rest of teh answers in few mins, this meeting will be short [15:06:08] T276589: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 [15:06:20] Sure, TIA [15:36:25] klausman: I'm back [15:37:10] wb :) [15:37:39] so, given that I'm not sure at this point what code was run, I will check [15:37:51] yes if you need to rename it anyway better to call the decom cookbook [15:38:09] that said, the dns.netbox run in it will fail dueto the drmrs networking issues [15:39:30] Is there a way to make it keep going? [15:40:09] yes, but I need to send a patch [15:46:16] btw I just manually edited /usr/bin/cookbook on cumin2001 to prevent execution [15:46:41] So I should go ahead and do my decom run? [15:47:19] give me 5 minutes [15:47:25] the decom have a call to the dns too [15:47:27] Sure. [15:52:28] klausman: actually you can run the decom, it woul dreport the failure of the dns run but will continue with the additional steps [15:52:39] Ok, will do [15:54:27] Mh. Is this expected: https://phabricator.wikimedia.org/P21602 ? [15:56:34] klausman: I don't see ml-etcd-staging2001 anywhere in netbox [15:56:43] but there are just the 2 assigned IPs with related DNS name [15:56:45] It was there ealrier (IPs) [15:57:22] and searching for `ml-etcd-staging2001` still finds them [15:57:36] yes, but no VM is tegistered in netbox [15:57:43] let me check the changelog [15:57:47] Ok, so I just delete the IPs and that's it? [15:58:05] (well both A/AAAA and PTR records [15:58:40] we surely need to delete the IPs and then run the sre.dns.netbox cookbook to propagate that change [15:58:53] question, was the VM created in netbox? [15:58:57] sorry in Ganeti [15:59:38] I don't think so, let em check [16:00:26] if just htat then yes, delete from the netbox the IPs, (if you need help letp me know), and then run the sre.dns.netbox cookboo [16:00:28] It's not listed with sudo gnt-instance list|grep etcd [16:00:39] I've sent the patch for the makevm one [16:00:40] (on ganeti2021) [16:00:55] ack [16:05:12] SO should I delete the netbox-side IPs and run the netbox cookbook or wait? [16:06:30] go ahead with those 2 actions [16:06:39] and then review the patch I sent :) [16:06:41] roger [16:06:45] thanks [16:13:59] Both complete (with errors for 6001 and 6002, as expected, cookbook stopped) [16:14:28] ack, that's the last step of the cookbook, so all good [16:14:30] https://phabricator.wikimedia.org/P21603 [16:14:36] for what concerns your change [16:14:37] (just for completeness) [16:14:50] thanks, yep confirmed [16:14:56] Thanks for your help. About creating VMs, is that now safe to do? [16:15:09] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/766785 [16:15:20] if you can do a quick review I'll merge and deploy [16:16:06] that seems a general alid use case fix to me, I didn't want to hardcode dns6* to exclude them in the dns cookboo [16:16:10] k [16:18:00] LGTM [16:22:18] * volans waiting on jenkins [16:23:47] klausman: change deployed, go ahead with the makevm cookbook, lmk if you have any issue [16:24:17] Alrighty, thanks again [16:25:15] no prob, and sorry for the trouble [16:27:38] No worries [16:36:32] Hrmmmm. Now the ganeti DNS lookup(?!) failed and a second DNS change asks me to delete the just-created names for the new VM [16:36:57] (plus changes for an-worker1147 and wmf5120) [16:37:05] Those are management IPs, tho [16:42:15] klausman: when the makevm cookbook fails it does rollback the netbox and dns changes [16:42:22] so that's expected [16:42:38] Alright. [16:42:53] the change for an-worker1147 and wmf5120 is probably topranks [16:43:00] and you have stepped on each other's foot [16:43:18] let me look at the logs on what actually failed for you [16:43:25] Well, I let it go through since my rollback was bening, and the other changes looked okayish [16:45:09] do you have the ganeti failure? [16:45:23] Sorry yes [16:45:30] sec [16:45:37] Should be fixed though [16:45:48] My apologies Tobias [16:46:02] No worries [16:48:21] https://phabricator.wikimedia.org/P21604 The Ganeti fail [16:48:43] `The given name (ml-staging-etcd2001.codfw.wmnet) does not resolve: Name or service not known` [16:48:50] Maybe a timing issue? [16:49:00] yeah dns issue [16:49:26] I'm running the dns cookbook again now, the IP duplication I caused is fixed up in Netbox. [16:49:33] Roger [16:49:35] interesting, never seen that happening klausman [16:49:45] was the name previously used? [16:49:47] Aren't I lucky :) [16:49:51] could have been a negative cache [16:50:45] I can give it another go, maybe I am luckier this time [16:51:19] give it a minute there, my name not resolving either [16:53:06] do you know who could I ask some general questions about puppet setup of new datacenter (not related to traffic, network or hw)? [16:53:26] topranks: sure, lmk if-when I should random commands ;) [16:55:23] jynus: all changes for drmrs have been tracked by T282787, so that could be a start. Jo.hn is out today/tomorrow btw. [16:55:23] T282787: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 [16:55:24] volans: not sure if I broke something somehow, maybe you might be able to assist. [16:55:29] topranks: shoot [16:55:44] volans: thank you, that is exactly what I needed- a pointer [16:56:16] jynus: do you have some specific question I can answer async? [16:56:29] So the DNS cookbook tells me there are no changes needed when I run it now: [16:56:38] https://phabricator.wikimedia.org/P21605 [16:56:55] yes [16:56:55] I am getting some backup errors from the new dc- I am assuming it is for WIP setup, but wanted to confirm/ignore them for some time [16:57:06] ^ bblack [16:57:10] jynus: drmrs has network issues [16:57:11] jynus: what gets backed up from there? [16:57:13] not super important [16:57:13] But it doesn't create this name: https://netbox.wikimedia.org/ipam/ip-addresses/10384/ [16:57:14] we had some network issues [16:57:14] a lot of things are broken [16:57:20] ah, ok [16:57:49] jynus: I'm just curious incidentally, since we mostly consider those sites to be stateless from our POV [16:57:58] there might be edge cases, though! [16:58:16] bblack: was that intentional [16:58:23] eyyyy [16:58:34] ? [16:58:51] edge cases I guess :) [16:58:51] oh, no, it wasn't [16:58:55] heh [16:59:13] bblack: one sec I get full logs [16:59:34] bast6001.wikimedia.org-Monthly-1st-Fri-production-home and install6001.wikimedia.org-Monthly-1st-Sun-production-srv-tftpboot [17:00:04] knowing there were network issues those failures would be expected [17:00:29] yeah I just wanted to have some idea what's being exported. seems pretty minimal [17:00:55] I don't know what the rationales are for caring about those directories in these cases at all, but they're not unreasonable either :) [17:00:57] and sorry to answer the obvious, those issues are still ongoing? [17:01:02] jynus: yes [17:01:07] ok, so getting out of the way [17:02:16] (but feel happy we have monitoring for backups that works) [17:02:21] :-) [17:06:56] klausman: I have the explanation for your ganeti failure, my bad, see -dcops for context (last few lines from me) [17:07:42] dcops? I don't think I'm on that channel :) [18:55:55] klausman: Finished creating my datahubsearch1002 vm [19:12:20] I have another vm to create, so I'm going to proceed with that shortly [19:26:17] razzi: quick one did you get some DNS changes when you ran the cookbook a short time ago for "lsw" devices in eqiad? [19:26:27] think we both went to run it around same time so I just cancelled mine [19:26:32] Ok yeah I did topranks [19:26:47] great thanks :) [19:26:58] some lines like `-irb-1038.lsw1-f4-eqiad 1H IN A 10.64.137.1` [19:27:14] yeah that's them. ok cool no need to re-run it. [19:27:31] ok thanks for checking in, I'll proceed [20:09:09] If you have small personal git repos somewhere (github, gerrit, ..) and are interested in trying gitlab, importing from any https git URL is open for self-service. They will be under a personal namespace though, so your username in URIs. If you want to import under something less personalized you can hit me up to do it. Just got the privs for that. [22:20:57] anyone know if our systemd custom resource (at least, I think that's what it's called) is the recommended way to handle systemd in puppet anymore? It looks like most of its use is on my team (Search) and I know we haven't updated our puppet code in awhile ;) https://github.com/wikimedia/puppet/blob/production/modules/systemd/manifests/service.pp [22:23:11] inflatador: yep, that's the best way to define a systemd service [22:23:39] `git grep systemd::service` in the puppet repo shows plenty of other uses too, if you're looking for examples to crib from [22:24:53] Thanks rzl , was just wondering why our stuff is dropping units in /lib/systemd/system instead of /etc/systemd/system .. am I correct to assume it's because $override is not set? [22:26:29] oh yeah, there's way more examples with your 'git grep'. I guess we're the only ones calling with "Systemd::Unit" [22:27:14] aha, yes [22:27:37] systemd::service includes a systemd::unit and I think is preferred over using systemd::unit directly but I couldn't swear to it [22:29:59] That's fine, thanks for the context. I'm still a puppet n00b, trying to figure out what's what over here ;) [22:30:50] if you ever find out, I'm sure we'd all love to know :D [22:34:22] ^_^ [23:15:49] inflatador: Using systemd::service is indeed more recommended because it handles the unit file but also the puppet service state and possibly monitoring. it's unit plus a wrapper around it to do it properly/standardize it [23:18:55] if you want periodic jobs / replace crons, then see systemd::timer::job which does it all, a service and a timer to start that service [23:42:25] mutante thanks, been talking this over with ryankemper and I think we use Systemd::Unit to reduce the chance of the ES services getting restarted? At least that's what the comment says here https://github.com/wikimedia/puppet/blob/e35198eedefeadc00daabc101609d2a901335a24/modules/elasticsearch/manifests/instance.pp#L309 [23:46:51] inflatador: hmm, ACK! though just "service" is a bit different from "systemd::service" as well. systemd::service does have $restart true/false and then there is $service_params where you can passthrough more custom parameters to the unit. [23:47:15] inflatador: mutante: well specifically we use `service`, which is puppet's service abstraction, instead of directly using `systemd::service`. and then we `require` the systemd::unit, which ...I assume gets created by the `service` resource, but then I'm confused why the require would be there [23:47:59] mutante: wrote the above before reading your response, but yeah you're spot on there with service vs systemd::service [23:49:40] to me "service" seemed more useful in the past when we had init-services that were not systemd yet, because it could abstract away both, afair [23:49:57] but now that everything is systemd I would say we probably want just our own systemd::service [23:50:52] not sure about the require for the unit, but yes systemd::service or service would both create a unit [23:51:43] and you could use service_params to do "custom" things if you don't want to be limited. that is if just "restart => false" is not enough and you want to mask things or something [23:52:59] wrt historically having non-systemd init-services, that makes sense [23:53:10] mutante: wrt the following [23:53:12] > but now that everything is systemd I would say we probably want just our own systemd::service [23:53:40] did you mean that we (search) probably now just want to use `systemd::service` directly instead of `service` [23:53:47] unless we expect something to replaced systemd and that we still use puppet at that time and puppet upstream would handle it for us, but dont think so :) [23:54:01] or alternatively did you mean that we (wmf) probably should have our own wnmf abstraction of `systemd::service` like how we do periodic jobs etc [23:55:53] ryankemper: I meant the first, that systemd::service is already our own standard. [23:55:59] see this commit message https://gerrit.wikimedia.org/r/c/operations/puppet/+/365900 [23:56:28] so it's already the wmf abstraction and therefore I would recommend it for you as well [23:57:03] because the puppet upstream stuff is "mostly a (complicated) [23:57:04] attempt at supporting multiple init systems." [23:57:28] mutante: oh I didn't realize `systemd::service` was our own abstraction, thought it was actually from systemd. and thanks for the link to the commit message, very helpful [23:57:54] yep :) yw [23:58:07] agreed, we get no benefit from the extra complexity of `service`, would be a good idea for us to get off it [23:58:38] probably right after we fix https://phabricator.wikimedia.org/T276198#7736542 would be a logical time to refactor that, so hopefully quite soon [23:59:58] sounds good. yea, "while at it" but not like you have to make it a priority to replace all asap. just if you have actual "cron" puppet resources left because that blocks puppet 6