[04:49:58] <_joe_> I've worked with trivago in the past. I wouldn't use them or worse their SRE team as an example of anything if not things to avoid doing. Maybe they got their act together in the meantime [04:50:29] <_joe_> "SRE", not sure they're even called that [08:23:26] <_joe_> say I have a label attached to a prometheus metric that reads A.B.C, and I'd like only "A.B" to appear in the legend [08:23:43] <_joe_> is there a way to do regex susbstitution / similar things in grafana? [08:29:24] _joe_: I think you can use https://grafana.com/docs/grafana/latest/variables/filter-variables-with-regex/ [08:30:09] <_joe_> volans: no, i need "a.b.c" when doing queries [08:30:22] <_joe_> but visually no one cares about c :P [08:30:24] yes but I think you can do substitution after that [08:30:35] <_joe_> it's a label, not a variable [08:31:22] <_joe_> it seems to apply to labels as well from that page, but it's not clear to me how :P [08:33:04] <_joe_> I might get away with creating a chained variable, sigh [08:43:27] :/ [08:43:33] I hoped would be easier [10:40:38] are the units on https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?var-site=drmrs&orgId=1 correct? should the top bar in fact be kWh? it would seem a bit unlikely that drmrs is using 218kW...? [10:46:25] it says "Totals for selected time window" and uses the sum_over_time function AFAICT [10:46:44] sum of the 1h average also... [10:57:17] <_joe_> uhm [10:59:26] <_joe_> the units are Kw not kwh, but maybe we're summing kwh instead of averaging? [11:00:33] I don't know what the original metric reports [11:01:09] <_joe_> I doubt it reports total power consumed [11:02:05] same [11:04:32] <_joe_> "The active power consumption of the input feed phase. A non- [11:04:34] <_joe_> negative value indicates the active power consumption in [11:04:36] <_joe_> Watts. A negative value indicates that the active power [11:04:38] <_joe_> consumption was not available." [11:07:05] <_joe_> so yes, the measure is in watts [11:08:44] <_joe_> that number above is the total power used over the last two days, in kwh [11:10:12] <_joe_> which means that the average power consumption of drmrs is ~ 4.5 kWh [11:10:25] <_joe_> err kW [11:10:46] <_joe_> Emperor: I hope I made matters more confusing :) [11:11:21] <_joe_> but TL;DR: the correct unit for the number above is Kw h [11:18:22] I thought that might be the case, but it is quite confusing :) [11:42:40] wonder if Grafana can be made to label correctly [12:38:03] _joe_: grafana has a box for formatting the legend with text.template [13:39:51] <_joe_> cdanis: thanks, I did figure that out in the meantime; sadly, it also turned out that I will have to patch prometheus-php-fpm-exporter. So i have bigger problems now :P [13:40:00] ahahah [13:40:09] it can't be that bad [13:40:41] <_joe_> (narrator: it was) [13:42:15] <_joe_> cdanis: no, there's actually an initial patch by the developer that is already going in the right direction [13:47:54] _joe_: dare I ask what you need to change? [13:48:43] <_joe_> cdanis: https://phabricator.wikimedia.org/T312634#8087055 [13:49:10] ahhhh [13:49:12] <_joe_> basically the functionality I'd need is implemented in https://github.com/bakins/php-fpm-exporter/pull/32 [13:54:41] I'm going to do some testing on sretest1001 [14:36:10] topranks: have time to help me understand what's happening with https://phabricator.wikimedia.org/T305194 and https://phabricator.wikimedia.org/T299574? I haven't got as far as having a theory beyond "they don't work" [14:40:38] andrewbogott: sure let me have a quick look [14:40:49] thank you! [14:45:14] Looking at cloudvirt1051.eqiad.wmnet I don't see any network problem [14:45:47] Interface eno2np1 is connected to cloudsw1-e4-eqiad xe-0/0/35 on the right Vlan [14:46:16] If I do a TCPdump I see traffic from 172.16.x addresses matching the cloud instances network [14:46:42] hm [14:46:55] let me check the VM again... [14:47:01] let me have a look at a working cloudvirt host to see how the networking is set up normally on them [14:47:18] thanks. Yesterday I was doing side-by-side comparison of 1047 and 1048 [14:47:54] And specifically canary1047-01 and canary1048-01 [14:48:04] I can get you logins on those VMs too if you don't see anything at the HV layer [14:48:21] cool yep that might help, just looking on the physical host right now [14:54:21] So looking on cloudvirt1048 "virsh" shows one VM on it, name 'i-00067765', tap interface 'tap54ee09a2-92' [14:54:29] it doesn't appear to be trying to send any traffic [14:54:46] is that canary1048-1 possibly? a shell on it might help [14:56:32] topranks: yes, and I just started a second canary which has the same problem [14:57:05] To get a shell, log into cloudvirt1048 and then 'sudo virsh console 5f5b1888-39c7-48aa-a86c-a306a0e4cff8' [14:57:33] Ok. It looks like it's failing to get an IP from DHCP [14:57:43] Do the cloudnet hosts provide the DHCP function on that network? [15:00:02] yes [15:00:38] yes, and I confirmed that the dhcp connection is working on other hypervisorrs [15:00:44] (well, last night I did) [15:00:58] I created a new VM earlier today and it worked fine [15:02:25] thanks, might be something odd on the switch. I can see DHCP requests getting to cloudnet1003 and responses being sent [15:03:19] lmk if there's anything I can do to help test [15:03:40] It's possible this is a config issue on my end but standing up these new hosts is pretty routine these days [15:07:53] The switch-side interfaces aren't set up correctly I believe. [15:08:13] eno2np1.1105 is a vlan-based interface - so the switch should be set up to tag packets for that to work, which it is not [15:08:31] If you look here it's just set with one vlan untagged: [15:08:32] https://netbox.wikimedia.org/dcim/interfaces/26134/ [15:09:27] topranks: is that something that can be documented someplace for future cloudvirts? [15:09:58] yeah I guess we can. this isn't anything new to do with the new racks [15:10:19] so previous cloudvirts also had that requirement [15:10:41] let me change it and see does it help first anyway [15:13:54] Yeah that sorted it alright [15:13:59] root@buildvm-84424939-7022-4ea6-9e97-0ef7d3df1af0:~# ping 172.16.0.1 [15:13:59] PING 172.16.0.1 (172.16.0.1) 56(84) bytes of data. [15:13:59] 64 bytes from 172.16.0.1: icmp_seq=1 ttl=64 time=0.407 ms [15:14:18] I'll make the same change for the rest of them. [15:16:51] thank you topranks! Let me know if I need to do anything for followup otherwise I'll just rebuild those canaries and see for myself :) [15:19:54] np! that change has been made for all the new hosts now so hopefully they all should be good to go [15:20:27] In terms of the process it should be documented for these hosts, the use of the "eno2np1.1105" sub-interface on them means the switch needs to know to expect tagged traffic. [15:21:29] Longer term we do hope to improve our automation to get vlan interface data imported into puppet, which in theory would allow us to detect this situation on import and set the switch ports accordingly in netbox. Few bits we need to work out to get there though, so for now when the second interfaces are configured in Netbox they need to be manually set to tagged. [15:26:28] somebody else is having issues spawning VMs on WMCS? [15:27:13] I've been waiting almost 5h30m for one [15:27:20] 😅 [15:27:30] We just resolved an issue which was blocking network access for VMs on a bunch of newer cloudvirt* hosts [15:27:45] Unsure if that might be the cause, problem should now be resolved [15:28:12] vgutierrez: which VM? [15:28:27] and next please tell us right away on -cloud please [15:29:15] no problem.. went for lunch, to the hospital.. and I've realised now that the VM is still being created [15:29:24] traffic-cache-atstext-buster under the traffic project [15:29:55] looking, thanks [15:32:33] vgutierrez: unrelated, but you should just delete and retry [15:32:43] (I mean, unrelated to the thing topranks is working on) [15:32:46] ack [15:33:15] yeah, I suspect it was lost on the database issues we had earlier today [15:33:34] +1 to deleting and re-creating, unless there's some way I don't know to tell openstack to reschedule [15:33:45] topranks: ready for me to rebuild all those canaries or do you have more config to change? [15:34:05] Nope all done on my end go ahead [15:34:09] great, thanks [15:43:52] topranks: everything is much better now. Thank you for the quick fix! [15:44:26] np! [16:17:34] I'm going to start the emergency eqiad linecard replacement, no impact expected [16:17:55] https://phabricator.wikimedia.org/T312745 [17:09:48] all done with the linecard replacement [18:14:44] \o/ [18:14:50] i was chasing which channel had this notification