[06:58:45] greetings [07:01:47] morning [07:15:07] 👋 [08:21:10] o/ [08:23:45] can someone review this mail about bastion host key changes before I send it out? https://etherpad.wikimedia.org/p/ssh-key-change-trixie [08:26:32] taavi: LGTM. nit: I would add a link to a phab task, so if anyone has an issue they can report it there [08:27:40] dhinus: done [08:28:19] thanks! [12:46:58] opinions on this iops quota request? T404254 [12:46:58] T404254: Increase iops for recommendation-api project - https://phabricator.wikimedia.org/T404254 [12:56:41] i was leaving that for the ceph experts, I have too little info about the effects of those limits to have an opinion [13:01:11] same tbh, also I don't know what the current quotas are in terms of read bandwidth or iops, anyways 500MB/s is what almost half of 10gbit [13:04:58] I didn't say what my point was: it seems a little too much to me for a single VM to be able to saturate ~half bandwidth [13:05:31] alternatively local storage could be an option (?) [13:06:52] we have a higher IOPS limit for toolsdb vms and it doesn't seem to cause trouble [13:07:18] I doubt they will ever reach 500mb/s though, and IIRC we only limit based on IOPS not on mb/s [13:33:36] Unless I'm misplacing a decimal, the default throttle is 200 and they're asking for an increase to 500, so a 2x increase [13:33:51] whereas the toolforge high-iops nodes have 10x [13:34:23] so it seems like a modest request [13:35:14] btw godog you can see the throttles with 'openstack flavor show' and a flavor ID. It's another reason why we don't let users manage their own flavors [13:35:37] andrewbogott: ah! thank you, TIL [13:39:17] hm, well, that's for VM local drives and I guess that ticket is about cinder... [13:41:13] for cinder you can do things like 'openstack volume qos show fc8ba52e-c079-4f56-b342-6f359366b08e' [13:45:41] godog: regarding local storage: standard hypervisors don't have drive space for local storage, but we have a cluster of three small HVs (cloudvirtlocal100x) which host VMs locally. At the moment we only use them for toolforge/toolsbeta etcd but there's space on there for a few more things. [13:46:17] But of course that doesn't help with cinder volumes since they're networked regardless; only with the / drive of the VM. [13:46:37] and of course local storage is a pain for users since they just have to eat any downtime that the hosting HV suffers. [13:46:57] I'm not suggesting anything here, just trying to address your (?) [13:47:07] hah good point, ok so local storage not really feasible [13:47:11] thank you for the explanation [13:48:31] It's feasible but we'd have to really trust the user to understand the downsides. And would be a bit more work for us maintenance wise. [13:49:25] so far etcd is the only workload we've seen that really really cares about latency for tiny reads and writes; that's the main use case for local storage. [13:49:37] * godog nods [14:00:43] I went back to read my experiments with toolsdb volumes where I did run some performance benchmarks https://phabricator.wikimedia.org/T301949#8326055 [14:00:59] we also have https://cloud-ceph-performance-tests.toolforge.org/ [14:01:50] andrewbogott: they're asking for 500mb/s but I think our limit is on IOPS, not mb/s [14:02:49] dhinus: we're inconsitent. There's a mb/s throttle on VM flavor and on the standard ceph flavor [14:03:03] uh... sorry, 'on the standard cinder type' [14:03:29] here's what I mean: first is standard cinder type, second is high-bandwidth type: [14:03:33] https://www.irccloud.com/pastebin/75t1Q6bI/ [14:04:03] total_bytes_sec='200000000' corresponds to the limit they're seeing so it seems likely that's what they're hitting [14:04:22] makes sense [14:05:13] so we could make them a cinder type with 3x the metrics of 'standard' and see how it goes. [14:05:20] sgtm [14:05:22] Or we could tell them to be patient and live with the slow system :) [14:05:48] since it's test/dev I'm not super compelled unless we're talking about multi-minute queries [14:13:05] looking at grafana I don't see them actually hitting the limits [14:13:09] I will ask for more details [14:29:28] in other news, we have a reply from europeana on T404347 [14:29:29] T404347: WMCS is sending millions of invalid requests to Europeana.eu servers - https://phabricator.wikimedia.org/T404347 [14:30:06] it looks like it's not the tool that was suggested by other commenters (the query params are different) [14:33:13] they sent some timestamps, taavi do you think we could match those to NAT logs? where are those logs? [14:33:46] * dhinus finds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network#routing_source_ip [14:40:31] maybe I'm doing something wrong but api.europeana.eu resolves to two IP cloudflare addresses, and I'm not finding those IPs in any of our nat logs [14:42:12] dhinus: note that the format the NAT logs use for IP addresses is a bit weird, it requires leading zeros in each octet etc [14:42:36] ah yes, thanks! [14:42:42] that's what I was missing :) [14:43:44] btw if you have thoughts on how and where this process should be documented, I want to hear those [14:52:30] I think the wikitech page above is quite clear, and I could find the logs easily [14:52:51] maybe we could add a note on how to search for things, for example I came up with this: [14:53:04] zgrep "172.066.137.231" natlog.log.?.gz |awk '{print $11}' |cut -d',' -f1 |sort |uniq -c [14:53:29] that finds 172.016.005.011, which is hupu2.wikidocumentaries.eqiad1.wikimedia.cloud [14:54:11] I'm pinging the maintainers of that project in the task [14:59:03] that matches with https://github.com/Wikidocumentaries/wikidocumentaries-api/blob/684ddde358d4356f700c55a65a43076f3a66cec8/europeana.js#L10 [15:00:51] it also matches tcpdump output from that host [15:00:59] I think we found it :) [15:01:14] in the end the IP alone was enough [15:04:29] I have to log off, but I'll be back later if you need me [15:04:41] otherwise have a good weekend :) [15:04:47] * dhinus off [19:41:57] dcaro: https://phabricator.wikimedia.org/T404471#11177519 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187892 [19:45:48] taavi: +1d [19:49:54] applied, geohack is getting throttled as expected [19:50:13] nice, thanks! [19:53:04] [this can definitely wait until next week:] only problem is that the error page doesn't show the logo, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187894 fixes that [19:54:26] +1d, I'll close the task, but added it to next week's meeting, feel free to add subtasks/followups if you want, but I'll wait to think about it until next week xd [19:54:33] taavi: thanks for the fix! [19:55:11] yeah I think I should write a short docs snippet explaining the error to tool maintainers, but that's definitely a next week problem [19:55:17] likewise thanks for being around!