[06:22:16] <_joe_> btullis, klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/844445 broke building virtually any production image as you've substituted the default mappings with that one, so no mapping can be found for e.g. "www-data". [06:22:58] <_joe_> fixing that :) [07:45:39] <_joe_> is it me or logstash returns empty resuts to any query? [07:46:38] <_joe_> well at least for anything on kubernetes [07:46:51] <_joe_> jayme: did we change anything? [07:48:51] <_joe_> ok now they're back, weird [07:50:02] yeah I was about to ask which ones didn't show up [07:50:17] <_joe_> actually nothing did [07:56:34] we did not change anything apart from the fact that I've updated mmkubernetes to no longer have to restart rsyslog every now and then [07:57:30] <_joe_> jayme: ok that's what broke everything then :P [07:58:05] nono, it was happily shipping logs. this is clearly your fault :D [08:11:50] I'd argue that this is a serviceops problem, so I'd assign the fault to your manager :D [08:15:31] <_joe_> agreed [08:37:08] which VM is responsible for running the puppet compiler for WMCS instances? [08:37:37] <_joe_> it's a few I think, what was the url for your pcc catalog? [08:38:07] yeah.. found it [08:38:08] vgutierrez: there's several VMs with acme chief yep [08:38:15] dcaro: what? [08:38:32] aaahhh pupppet compiled xd, nevermind misread [08:38:44] *compiler [08:38:56] for some reason I read cert compiler [08:41:39] * vgutierrez sends some coffee to dcaro [08:45:12] so.. yesterday I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/849121 to get the deployment-puppetmaster04 pub key updated [08:45:37] but I'm still getting ERROR:root:request denied, ensure you have added the puppetmaster certificate https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_cloud [08:45:37] when running ssh deployment-puppetmaster04.deployment-prep.eqiad1.wikimedia.cloud sudo /usr/local/sbin/puppet-facts-upload [08:46:33] _joe_: sincere apologies for breaking the production-images build. How did you fix it? [08:46:57] <_joe_> btullis: just added back all the mappings we have as default configuration in the software itself [08:47:12] and role::puppetmaster::standalone::upload_facts: true is indeed applied for deployment-puppetmaster04 [08:47:32] vgutierrez: ill take a look [08:47:37] jbond: thx <3 [09:08:09] vgutierrez: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/849484 im not sure why but the pubkey and the pubic cert didn't match. i have copied the old key to $(sudo facter -p puppet_config.hostpubkey).bak and dumpt the correct key with [09:08:13] openssl x509 -in $(sudo facter -p puppet_config.hostcert) -noout -pubkey | sudo tee $(sudo facter -p puppet_config.hostpubkey) [09:08:55] i have uploaded the facts and running the processor now [09:20:23] jbond: 😓 [09:20:45] thx [09:21:50] np [09:59:06] jbond: BTW, found a small bug regarding pcc ability to mark a host as NOOP [09:59:12] https://puppet-compiler.wmflabs.org/pcc-worker1001/37765/ [09:59:30] NOOP hosts for that PCC run are the deployment-cache ones [10:14:33] vgutierrez: ack thanks we see this very sporadicaly every now and again but i havn't had any tomie to investigate, task is https://phabricator.wikimedia.org/T224977 [10:15:02] LOL I've commented in that task 3 years ago [10:15:11] my memory isn't as good as I think [10:49:49] :) [11:14:10] vgutierrez: should have paid for ECC ;-) [11:14:51] Emperor: beer overpowers ECC in the brain sadly [13:23:13] apergos: are we ready for me to shut down labstore100[67]? I see that some dumps are still not working perfectly but I'm not sure those hosts have anything to contribute at this point... [13:23:26] (also, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/849193 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/849192) [13:23:47] andrewbogott: just now reviewing the uh [13:24:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/849192/1 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/849193 [13:24:10] thx [13:24:34] just working our way slowly through the dialy triage :-) [13:27:06] I probably did it *shrug* [13:28:11] hm, wrong channel [15:00:20] every time that I need to write the FQDN of a cloud instance I feel like I'm in GoT: Daenerys Stormborn of House Targaryen, the First of Her Name, Queen of the Andals and the First Men, Protector of the Seven Kingdoms, the Mother of Dragons, the Khaleesi of the Great Grass Sea, the Unburnt, the Breaker of Chains [15:00:41] o_O [15:00:53] deployment-cache-upload07.deployment-prep.eqiad1.wikimedia.cloud isn't short at all [15:01:37] 😂 [15:04:27] vgutierrez: it made my day thanks [15:07:39] You could try think of it as: Seventh of Many of the Uploaders of Cache, Deployer of Prepared Deployments, in the First House of Equinix of Ashworth, of the Realm of Wikimedia, Kingdom of the Cloud [15:10:44] do you think it'd be possible to increase memory of lists1001 (VM)? it's 6GB [15:11:07] Amir1: yes, that's nothing [15:11:11] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lists1001&var-datasource=thanos&var-cluster=misc&from=now-3h&to=now&viewPanel=18 [15:11:16] it's swapping it seems [15:11:29] Amir1: https://wikitech.wikimedia.org/wiki/Ganeti#Resize_a_VM [15:11:52] awesome [15:16:34] I guess it would help to put eqiad1.wikimedia.cloud into the dns search path on the cloud bastion [15:17:41] in general it's possible to configure ssh to tab complete on projects with puppetdb, for example I have https://git.sr.ht/~taavi/dotfiles/tree/master/item/bin/wmf-update-known-hosts-wmcs#L13 [15:25:47] <_joe_> vgutierrez: https://bash.toolforge.org/quip/VLjmFIQB6FQ6iqKiF1Nx [15:28:26] Ahahha fair enough [15:29:12] cdanis: Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti; restarting from within the instance will not be enough. [15:31:00] let me see when I can get the vm restarted [15:32:41] now that you are speaking of ganeti, I think I saw a ganeti host having higher memory pressure than usual [15:32:58] Amir1: here is a cookbook for that ;) [15:33:20] vgutierrez: Of course the British royalty titles are even worse than that. https://en.wikipedia.org/wiki/List_of_titles_and_honours_of_William,_Prince_of_Wales#Royal_and_noble_titles_and_styles [15:33:25] nice [15:33:33] not necesarilly a bad thing, but it is there because in the past it was an indicator of performance issues: "ganeti1018 WARN Memory 93% used. Largest process: qemu-system-x86 (30709) = 12.7%" [15:34:29] starting from yesterday, maybe an issue due to unbalance of vms due to reboots? [15:34:52] Amir1: elsewhere on the wiki page :) [15:35:29] https://wikitech.wikimedia.org/wiki/Ganeti#Shutdown/startup_a_VM [15:37:27] <_joe_> volans, XioNoX I see that netbox is marked pooled in both datacenters right now, but it's marked active_active: false in service::catalog [15:37:50] yeah, check https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=ganeti&var-instance=All&var-datasource=thanos&from=now-2d&to=now (memory section) there may be an inbalance on vm distribution [15:37:54] <_joe_> so that is making the compilation of the state file fail [15:38:06] *imbalance [15:38:52] Amir1, did latency issues started yesterday or they had happened for longer? [15:38:55] jbond: ^^^ [15:39:03] for longer to my knowledge [15:39:09] it has been going on for a while now [15:39:10] ah, so this is a separate issue [15:39:22] intermittent [15:39:44] <_joe_> volans: if you tell me which dc should be active, I can fix it [15:40:18] _joe_: eqiad should be active, I think this was an unwanted by-product of the reboots [15:40:49] <_joe_> volans: ok, fixing [15:41:37] the increase of memory for lists was way overdue, the amount was set when we were only on mm2. The new service has many things like search, etc. It's not really working for it [15:42:21] I belive my ganeti warning will fix itself when those hosts are roll-restarted [15:42:23] [13/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for lists1001.wikimedia.org [15:42:35] at which point I should panic? :P [15:42:42] Amir1: that's not the cookbook you were supposed to do [15:42:43] *use [15:42:44] jobo: volans: ack fixing now, gusse there is still a bug in the cook book :) [15:42:56] volans: I used reboot single vm [15:43:09] sudo cookbook sre.ganeti.reboot-vm lists1001.wikimedia.org [15:43:12] ah ok [15:43:17] then it's the correct one :) [15:43:18] bit vms for me tend to restart very quickly [15:43:18] sorry [15:43:21] *bit [15:43:27] *but [15:43:59] yeah, 2 minutes for a vm seems a bit excessive [15:44:05] 2 minutes and counting [15:44:15] 3 [15:44:53] I will panic after five [15:45:34] Amir1: check with moritz but I see it took quite some time in some past cases: https://sal.toolforge.org/production?p=0&q=sre.ganeti.reboot-vm&d= [15:46:29] wow [15:46:47] if I knew, I would have announced it [15:46:57] it's emergency anyway [15:47:00] you also might have just hit the bug where networking doesn't come back up correctly [15:47:15] Amir1: are you able to view the console of the host? https://wikitech.wikimedia.org/wiki/Ganeti#Get_a_console_for_a_VM [15:47:54] doesn't look like it [15:49:10] it's stuck in console [15:49:18] (nothing is showing up) [15:49:45] ^L get anything? [15:50:58] ladsgroup@ganeti1027:~$ sudo gnt-instance console lists1001.wikimedia.org [15:51:09] or try rebooting again and attaching the console as soon as possible, I wish you could poll for the console connection [16:04:00] there is some weirdness on job logs with multiple network errors, but before maintenance [16:05:12] Amir1: still stuck? need a hand? [16:06:20] I'm talking to Mortiz on this, I don't want to bother too many people [16:06:40] network errors started yesterday at 8:15 [16:07:59] ok! [16:09:35] cdanis: do you mind asking you some pointers regarding work done on Victorops later this week? (I think it was you?) [16:09:49] jynus: sure, no problem [16:10:12] not sure if you're asking about the ICS file export or the 'escalator' function for fixing batphone delay, but either way, yes [16:11:02] yeah, both and if additionally you have some general thoughts. too [16:11:22] I want to understand the work done to document it on needs/flaws [16:19:35] effie: the llamas in your SREcon slides are awesome :) [16:22:32] about to disappear, but please call if you need more hands re:ganeti [18:05:38] cdanis: fwiw and according to Mortiz, it seems the cookbook doesn't go well when you resize a VM, you have to manually shut it down and restart it [18:05:53] (fixed for a while now but that was the root cause) [18:05:55] hmmmm [18:06:10] perhaps the cookbook needs to be modified in some way then :) [18:06:19] > I'll fix the docs to specifically mention shutdown/startup, sre.ganeti.reboot-vm is meant to restart the qemu process (what we need if we need to apply e.g. a qemu update or a ganeti config settting change), I suppose the command it triggers (sudo gnt-instance reboot VMNAME) can't cope wiht a changed hardware? not sure [18:06:22] possibly [18:06:32] still not 100% sure though [18:06:48] interesting [18:26:02] I'll bump the prio/work on https://phabricator.wikimedia.org/T219454 in the next months, this should just be a cookbook to run and not manual commands which are error-prone [22:20:58] bd808: we <3 llamas ! [22:23:01] effie: I have a pink stuffed llama-corn in my all too large collection of unicorn things :) [22:24:07] oh Wow, please email me a picture when you can [22:29:11] effie: photos in the either somewhere between my phone and your gmail account [22:29:22] *ether [22:53:25] received, made me smile, thank you!