[12:39:48] Hello! Can yall point me to any docs on how (if?) we do traffic throttling (for APIs?)? I'm being asked by a manager to explain if there are any safeguards for people DoSing intake-analytics.wikimedia.org. This is the eventgate-analytics-external public endpoint. [12:42:57] <_joe_> surely not on a publicly logged channel :D [12:51:10] oh, k ;) [13:10:57] hi, who is lately the puppet expert? I would like a second opinion on if to write a task about some performance issue I am seeing [13:15:03] jynus: could you add me to that task please? [13:15:15] I have not created one yet :-D [13:15:30] You know what I meant [13:15:38] I will :-) [13:15:46] thank you [13:15:56] I have a suspicion that puppet runs kill my backup runs! [13:16:19] although it could be not puppet, but what is setup on puppet to run [13:17:28] FYI folks, I plan to start updating conftool for T365123 at around 14:00 UTC today (deferred from last week). this should be a low-risk release, but please let me know if you have questions / reservations. [13:17:28] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [13:19:16] <_joe_> jynus: I guess I'd ask in the I/F channel [13:19:38] <_joe_> although I've been told that elukey is the team interface for I/F now, but I wouldn't trust that info :D [13:20:20] jynus: if puppet is killing your backup runs somehow it should be obvious in the puppet.log on the machine [13:21:24] cdanis: it is not directly killed, it exhausts network/io [13:21:40] I am retrying again with pupper disable to test the correlation [13:22:08] something is writing to disk at almost 3MB/s and creating a lot of network errors [13:22:17] and correlates with puppet runs [13:22:48] see persistence- the log says it is doing nothing at the time, but the time correlates [13:23:38] (on 2 separate hosts) [13:23:55] doing a run with puppet disable to test the negative hypothesis before digging more [13:26:22] my guess is there could be some extra load for certain type of hosts only (e.g. gathering RAID facts, or something affecting only a subset of hosts) [13:26:44] the load is variable yes [13:27:22] but it is weird that it started happening on 2 hosts at the same time, from different datacenters (but same hw model) [13:27:33] which hosts [13:27:44] backup[12]002 [13:28:27] https://grafana.wikimedia.org/goto/AOlwAU8Sg?orgId=1 [13:28:43] a small rate of tcp/attemptfails is nothing to worry about [13:28:50] those network errors, while low, are enough to kill my backups [13:29:02] I know, but they break my tcp connections :-D [13:29:11] no, they are new TCP connections failing to be established [13:29:21] probably because puppet is trying to connect to something not listening where it expects [13:29:56] yeah, then probably unrelated, but the effect is real [13:30:11] what kind of timeouts do you have set on your dumps? [13:30:53] yeah, those have been happeing for over a month [13:31:03] so this is new behaviour [13:31:21] I left everything as default, to be fair, as it used to work well [13:32:53] what kind of timeouts do you have set on your dumps? [13:33:20] can you tell me about the conditions under which the ... I guess it's Python? MySQL client issues "Lost connection to MySQL server during query" [13:33:39] I would have to check the app defaults [13:33:42] please [13:33:54] but I am not worried about the backups, I know I can make those work [13:34:17] I am worried about the behaviour change- seemingly related to puppet runs [13:34:26] nothing substantial has changed on the puppet side either, the tcp/attemptfails are a red herring, and the 3MB/s of disk writes is less than 5% of your own write bandwidth [13:34:36] so I would like to understand the MySQL side better if you would humor me :) [13:36:01] there is also no saturation reported on the NIC or the disk [13:37:46] https://github.com/mydumper/mydumper/blob/v0.10.1/mydumper.c [13:38:32] I don't see the error string in that file [13:38:52] yeah, it must be the mysql driver [13:39:38] can you perhaps get an strace around the time the error happens? [13:39:40] Allow me to finish the backups [13:39:42] ok [13:40:05] which seem to now go smoothly with puppet disabled (coincidence?) [13:40:15] and I will file a task with all details [13:40:26] I agree there is probably something causing puppet (or increased IO load in general) to cause a hiccup in the backups [13:40:37] I think is on the client [13:40:49] because it failed on both servers at the same time, for both clients [13:41:11] AND that server is a bit special (only one left with a disk array) [13:41:33] so there could be something, not puppet's fault, but that make it weird on that host [13:41:42] an strace around the time of the backup failing would be really useful [13:41:54] it would also be useful to know if it happens when you add some synthetic I/O load other than puppet [13:41:55] that's file [13:41:57] *fine [13:42:08] I just wanted a second opinion to discard I was crazy [13:42:26] and onece I fix the more immediate issues (failures) [13:42:47] I will provide more info and ok to add you to the ticket, cdanis too ? [13:42:51] 👍 [13:42:59] thanks for the help [13:44:25] it is easier to debug when not running on a production server (e.g. backing up a host I don't care about) [13:46:43] and definitely the prometheus granularity is not enough, but I think the load looks higher than in other hosts [13:47:11] jynus: yeah, indeed, but that's why we have nic_saturation_exporter, which runs at 1Hz [14:07:46] starting conftool updates now. I'll be logging in -operations [14:09:15] * volans around [14:29:16] Rollout of 100% of external traffic to mw-on-k8s in progress, by the way :) [14:30:04] gl and congrats! [14:30:28] nice! [14:31:39] ✨ 🥳 ✨ [14:41:56] <_joe_> we have probably some stragglers in the datacenters served by codfw [14:42:01] <_joe_> where puppet didn't run before [14:42:17] <_joe_> but overall traffic to physical servers seems to be down to almost zero besides monitoring [14:42:32] Yeah, 45rps [14:45:09] <_joe_> there's ssitll some external traffic on a random appserver in codfw [14:45:16] <_joe_> like 1 rps [14:47:36] where's doc.wikimedia.org hosted? [14:48:02] claime: doc* hosts [14:48:20] 1003 and 2002 [14:48:20] ok [14:50:04] congratulations to all the k8s wizards! an inspiring achievement :) [14:50:49] _joe_: where are you seeing that? farmer's grep? I'm not seeing any traffic without an orchestrator.namespace in logstash [14:51:15] <_joe_> claime: given phys hosts don't relay their apache logs to logstash [14:51:20] ah yeah [14:51:22] <_joe_> I wouldn't expect you to see it [14:51:23] that'd explain it [14:51:27] :p [14:51:30] (I keep forgetting) [14:52:34] <_joe_> and now, we're down to just monitoring [14:52:58] <_joe_> claime: I don't see many stragglers, at least on the appservers [14:53:02] <_joe_> not sure about api though [14:53:46] <_joe_> but yeah, all the live traffic is now de facto on k8s [14:55:46] <_joe_> can't believe it [14:55:48] <_joe_> :) [14:58:29] I see no traffic that is not Twisted, icinga, server-status, BlankPage or httpbb after 1448 [15:06:41] <_joe_> yep :) [15:12:20] claime: congrats!!! [15:12:36] I did a brain fart and forgot to have a change reviewed for some new hosts, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042227 if someone would be so kind as to give it a once-over, that'd be nice. [15:12:37] I didn't think we would ever get to 100% tbh [15:12:45] cdanis: thanks! _joe_ did the hard work though ;) [15:12:54] I just did the Human Pod Autoscaler [15:12:57] (and yes, I know everyone is kinda busy) [15:13:53] <_joe_> cdanis: me neither :) [15:17:12] https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=5m&viewPanel=4 --> this one needs to be updated? [15:17:46] oh yeah [15:18:02] let me check [15:19:38] ah yeah we don't have that exact metric for k8s [15:24:30] I don't even know what's emitting it [15:25:17] <_joe_> we dp [15:25:25] <_joe_> via the benthos stuff [15:25:48] <_joe_> but I'd just use the mean latency from the mesh like you did on the RED dashboard for now [15:30:24] sum(rate(mediawiki_http_requests_duration_sum{deployment=~"mw-api-ext|mw-web", handler='php'}[5m])) / sum(rate(mediawiki_http_requests_duration_count{deployment=~"mw-api-ext|mw-web", handler='php'}[5m])) < looks ok ? [15:30:35] it reports a lot higher latency than the former metric [15:31:56] <_joe_> claime: yeah I'm not sure it's computed correctly [15:33:05] wow congrats folks :) [15:35:37] <_joe_> claime: so we definitely need to fix that pipeline, but in the meantime I'd use the envoy metrics [15:36:05] ok [15:36:56] <_joe_> we need to fix it on statuspage too :) [15:37:16] yes [15:37:24] <_joe_> but yeah, basically the 50 ms it was reporting now didn't make much sense either [15:42:41] https://grafana.wikimedia.org/goto/n0PT8U8Sg?orgId=1 [15:48:42] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047115 [15:50:34] <_joe_> claime: +1'd [15:50:41] ty [16:19:26] FYI, conftool updates are taking a bit longer than expected after some issues discovered while importing the packages to apt.w.o. That's now resolved and I'm moving ahead with updates / tests. [19:25:15] FYI, conftool updates finished around 18:45 UTC. Other than the packaging related hiccough at the beginning, no further issues encountered. [19:27:10] thanks so much swfrench-wmf :)