[07:47:17] <Emperor>	 Argh, ms-be2057 won't PXE boot. It gets to the relevant point then complains about media failure (I htink, the message goes past very quickly)
[07:49:04] <Emperor>	 Is it possible to put it back into puppetdb at this point? The reimage cookbook takes it out
[07:49:26] <jynus>	 it should be
[07:49:40] <jynus>	 it happened to me
[07:50:19] <Emperor>	 https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation I guess
[07:52:30] <Emperor>	 I presume the underlying issue is that this host has two Broadcom 10G ports, and it's trying to PXE off the wrong one.
[07:55:06] <moritzm>	 sounds like https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting
[07:56:40] <Emperor>	 moritzm: might be (and worth a look), but I did see an error message about not finding media (or something like that) which made me think it was trying the wrong NIC
[07:57:11] <moritzm>	 yeah, might very well also be that
[07:58:38] <Emperor>	 Probably worth a poke round the BIOS menu to see if something looks obvious there
[08:50:08] <Emperor>	 web-IPMI gave me a "just boot off the mezzanine card next time" option that seems to have got us into the installer at least
[09:54:30] <Emperor>	 let's try the next one of this sort and see if it's better behaved
[10:18:11] <Emperor>	 Hm, the Dell-specific docs are a bit out of date; "racadm config -g cfgServerInfo -o cfgServerFirstBootDevice" gets a deprecation warning on earlier BIOSes, and on newer ones just refuses to work at all
[14:13:47] <addshore>	 Is it possible to get metrics for "read from memory percentage" per table? rather than just by host?
[14:16:17] <jynus>	 addshore: my first suggestion would be io_global_by_file_by_latency
[14:17:14] <addshore>	 Another question, I'm looking at https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics?orgId=1&forceLogin=&from=1649700879360&to=1651449599000&viewPanel=15 and trying to determine if there is anything special about db1171
[14:17:36] <addshore>	 ie, this host has the lowest read from memory percentage for s8 it seems (99.4%)
[14:18:02] <kormat>	 addshore: it's a backup source
[14:18:13] <addshore>	 epic, where can I check that ? :)
[14:18:42] <kormat>	 https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/db1171.yaml
[14:18:53] <kormat>	 "dbstore" in _this_ context means "backup source"
[14:18:58] <addshore>	 so that is only a backup soruce?
[14:19:05] <addshore>	 ie no user traffic
[14:19:16] <kormat>	 correct
[14:19:21] <addshore>	 then db1154 is the host that sits infront of cloud dbs
[14:19:24] <jynus>	 addshore: I think you should focus on mw config
[14:19:44] <kormat>	 addshore: yes, db1154 is the sanitarium
[14:19:49] <jynus>	 ah, I was assuming you only wanted mw performance
[14:19:51] <kormat>	 also no user queries
[14:20:22] <addshore>	 So we have this dashboard that some folks look at https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics?orgId=1&from=now-30d&to=now&viewPanel=25
[14:20:46] <addshore>	 and I was just reading a doc that said we might want to worry about the fact that some of these number are not at 100%
[14:21:03] <addshore>	 personally I'm not worried, and it also turns out the ones getting organic user traffic are at 99.9% anyway
[14:21:36] <kormat>	 addshore: db1167 is the sanitarium master, and also vslow/dump. so it'll get very different queries than normal replicas.
[14:21:44] <addshore>	 lovely :)
[14:22:16] <addshore>	 orchestrator looks lovely btw :)
[14:22:20] <kormat>	 and finally db1116 is the other backup source for s8
[14:22:29] <kormat>	 addshore: it's Such a nice improvement over tendril :)
[14:22:43] <jynus>	 addshore: backup source should be able to be filtered on grafana
[14:22:49] <jynus>	 they use a different job
[14:23:09] <addshore>	 ooh, nice, I'll add that as a note in this document I am commenting on
[14:23:33] <jynus>	 for mw, I think the job is caller core or something like that
[14:23:44] <jynus>	 that way you don't get dbstore or test hosts
[14:24:43] <kormat>	 addshore: job="mysql-dbstore" for the backup sources, "mysql-core" for the others in that graph
[14:24:59] <addshore>	 thanks!
[14:25:15] <jynus>	 another thing you could do is skipp if the number of queries is too low- meaning the host is under maintenance
[14:26:08] <jynus>	 sanitarium I think also has a separate job
[14:29:10] <kormat>	 https://usercontent.irccloud-cdn.com/file/ssoX2qX4/image.png
[14:29:54] <jynus>	 then it is probably missconfigured on our side (zarcillo)
[14:30:37] <jynus>	 it was expected to be on the labs/cloud job, leaving core for proper mw hosts
[14:30:53] <jynus>	 ah, it is the master
[14:31:04] <jynus>	 sorry, I thought it was sanitarium itself
[14:31:10] <jynus>	 then it is ok
[14:32:13] <jynus>	 db1154 is the one that should be on the labs job
[14:33:48] <jynus>	 as that would impact the app sever peformance graph