[07:47:17] Argh, ms-be2057 won't PXE boot. It gets to the relevant point then complains about media failure (I htink, the message goes past very quickly) [07:49:04] Is it possible to put it back into puppetdb at this point? The reimage cookbook takes it out [07:49:26] it should be [07:49:40] it happened to me [07:50:19] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation I guess [07:52:30] I presume the underlying issue is that this host has two Broadcom 10G ports, and it's trying to PXE off the wrong one. [07:55:06] sounds like https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting [07:56:40] moritzm: might be (and worth a look), but I did see an error message about not finding media (or something like that) which made me think it was trying the wrong NIC [07:57:11] yeah, might very well also be that [07:58:38] Probably worth a poke round the BIOS menu to see if something looks obvious there [08:50:08] web-IPMI gave me a "just boot off the mezzanine card next time" option that seems to have got us into the installer at least [09:54:30] let's try the next one of this sort and see if it's better behaved [10:18:11] Hm, the Dell-specific docs are a bit out of date; "racadm config -g cfgServerInfo -o cfgServerFirstBootDevice" gets a deprecation warning on earlier BIOSes, and on newer ones just refuses to work at all [14:13:47] Is it possible to get metrics for "read from memory percentage" per table? rather than just by host? [14:16:17] addshore: my first suggestion would be io_global_by_file_by_latency [14:17:14] Another question, I'm looking at https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics?orgId=1&forceLogin=&from=1649700879360&to=1651449599000&viewPanel=15 and trying to determine if there is anything special about db1171 [14:17:36] ie, this host has the lowest read from memory percentage for s8 it seems (99.4%) [14:18:02] addshore: it's a backup source [14:18:13] epic, where can I check that ? :) [14:18:42] https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/db1171.yaml [14:18:53] "dbstore" in _this_ context means "backup source" [14:18:58] so that is only a backup soruce? [14:19:05] ie no user traffic [14:19:16] correct [14:19:21] then db1154 is the host that sits infront of cloud dbs [14:19:24] addshore: I think you should focus on mw config [14:19:44] addshore: yes, db1154 is the sanitarium [14:19:49] ah, I was assuming you only wanted mw performance [14:19:51] also no user queries [14:20:22] So we have this dashboard that some folks look at https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics?orgId=1&from=now-30d&to=now&viewPanel=25 [14:20:46] and I was just reading a doc that said we might want to worry about the fact that some of these number are not at 100% [14:21:03] personally I'm not worried, and it also turns out the ones getting organic user traffic are at 99.9% anyway [14:21:36] addshore: db1167 is the sanitarium master, and also vslow/dump. so it'll get very different queries than normal replicas. [14:21:44] lovely :) [14:22:16] orchestrator looks lovely btw :) [14:22:20] and finally db1116 is the other backup source for s8 [14:22:29] addshore: it's Such a nice improvement over tendril :) [14:22:43] addshore: backup source should be able to be filtered on grafana [14:22:49] they use a different job [14:23:09] ooh, nice, I'll add that as a note in this document I am commenting on [14:23:33] for mw, I think the job is caller core or something like that [14:23:44] that way you don't get dbstore or test hosts [14:24:43] addshore: job="mysql-dbstore" for the backup sources, "mysql-core" for the others in that graph [14:24:59] thanks! [14:25:15] another thing you could do is skipp if the number of queries is too low- meaning the host is under maintenance [14:26:08] sanitarium I think also has a separate job [14:29:10] https://usercontent.irccloud-cdn.com/file/ssoX2qX4/image.png [14:29:54] then it is probably missconfigured on our side (zarcillo) [14:30:37] it was expected to be on the labs/cloud job, leaving core for proper mw hosts [14:30:53] ah, it is the master [14:31:04] sorry, I thought it was sanitarium itself [14:31:10] then it is ok [14:32:13] db1154 is the one that should be on the labs job [14:33:48] as that would impact the app sever peformance graph