[00:45:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10Papaul) 05Open→03Resolved mgmt DNS for k8s2029 and 2030 fixed. @akosiaris all yours the last node will be track @T345650 [08:03:51] hello folks! [08:04:16] the WME team will stop hit ORES soon (today) and during the next days they will start hitting Lift Wing instead [08:04:45] we don't have all the redis caching workflow on LW, so the WME's traffic will translate into fetching data from the MW API [08:05:15] I'd expect it to be around 50 rps max, it seems ok compared to the volume of GET traffic that we handle (~5k rps from the RED dashboard) [08:05:18] but lemme know otherwise [08:13:14] if it is indeed 50rps, it's negligible [08:13:23] but thanks for the heads up [08:18:24] ack thanks! [08:19:04] the api gateway will cap them at maximum 250k rph, so I'd expect 50/60 rps max [08:19:14] (they know the limit etc..) [09:39:05] 10serviceops: Setup kubernetes20[25-53] - https://phabricator.wikimedia.org/T345709 (10JMeybohm) [09:39:44] 10serviceops: Setup kubernetes20[25-53] - https://phabricator.wikimedia.org/T345709 (10JMeybohm) p:05Triage→03Medium [09:57:54] 10serviceops, 10Prod-Kubernetes: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507 (10JMeybohm) [12:08:02] hey folks I was revewing some slo dashboards [12:08:21] and I noticed https://grafana.wikimedia.org/d/slo-Etcd/etcd-slo-s?orgId=1 [12:08:36] etcd main codfw seems to have exhausted al its error budget [12:09:20] and from https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=etcd&var-error_rate=0.001&var-slo_latency_threshold=0.032&from=now-90d&to=now&viewPanel=22 it seems something happened on 18/7 to increase latencies [12:14:29] nothing really broken atm [12:14:35] just increased latencies [12:17:23] hum...interesting [12:38:56] what on earth... [12:39:37] at some point we may need to add alarms when the error budget gets burned [12:40:55] yup. rzl ^ [12:41:48] sal has an lvs1013 provisioning, but nothing else that seems related [12:42:49] wrong timeframe actually [12:43:13] but still, just an ipoid deploy in the relevant timeframe (12:23) [12:43:46] oh for the love of, that dashboard isn't in UTC [12:45:22] remove asw-b1-codfw from asw-b-codfw VC - T342076 [12:45:34] ok, that one looks more promising finally [12:51:10] conf200[56] QGET latency increased [12:51:48] ah no, wait, all QGET latencies increased + conf2005 PUT latency [12:51:54] which kinda hints at conf2005 being the problem [12:54:14] yup, TCP retransmits quadrupled https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf2005&var-datasource=thanos&var-cluster=etcd&from=1689663496852&to=1689684993746&viewPanel=31 [12:55:00] it's actually way worse for conf2004 [12:55:23] ah interesting! [12:56:45] conf2004 is worse and actually IS on asw-b-codfw so maybe I 'll focus on that one first [12:57:55] I don't see anything that stands out from librenms for 2005 https://librenms.wikimedia.org/device/device=97/tab=port/port=8902/ [12:59:16] 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10SCherukuwada) Hello! Seddon and I are meeting on Friday. We'll have a concrete action plan (or the beginnings of one) to share on Monday. [13:04:11] elukey: however, conf2004 starts dropping packets at this exact date: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf2004&var-datasource=thanos&var-cluster=etcd&from=1688986344691&to=1690362145907&viewPanel=11 [13:04:47] confusingly, conf2006 is anyway dropping packets long before that? [13:09:32] so, those drops are all for node_network_receive_drop_total [13:09:52] so the packets are leaving the switch just fine, the node is dropping them? possibly conntrack? [13:10:17] ethtool on 2005 says: rx_discards: 24327 [13:10:26] and mbuf_lwm_thresh_hit: 24327 [13:13:57] lol, the second hit on google for mbuf_lwm_thresh_hit is https://phabricator.wikimedia.org/T191996 [13:14:05] ahahah yes I was reading it [13:15:00] but your theory about conntrack or similar is sound, it seems an issue with the nodes not being able to cope with traffic [13:15:18] niah, I forgot to tell you I disproved it [13:15:25] conntrack usage per grafana is at 5% [13:15:48] elukey@conf2005:~$ sudo netstat -tunap | wc -l [13:15:48] 9326 [13:16:17] conf2004 has 7361, but it does indeed have tg3 driver cards [13:16:28] elukey@conf2005:~$ sudo netstat -tunap | grep TIME_WAIT | wc -l [13:16:29] 8964 [13:16:35] Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe [13:17:18] the uptime is also nice [13:17:22] :D [13:18:19] I'd propose a reboot to pick up a new kernel etc.. but between etcdmirror and pybal it may be a long one [13:18:30] with the new equinox-based based switchover scheme we'll at least be able to reboot conf* servers annually [13:19:10] at the least [13:19:46] mw1349 and -51 went down a bit ago, not finding anything in SAL to indicate why. they're both pooled=no in conftool, but I guess I should set those to pooled=inactive and file a task? [13:19:48] shall I open a task with an initial explanation? [13:19:53] oh, dammit, those are buster hosts [13:20:22] might be claime rebooting them, maybe they didn't come back up? [13:20:40] taavi: there is a rolling reboot for security reasons of appservers happening today [13:24:03] akosiaris: one thing worth noticing - almost all the TIME_WAIT tcp conns are sockets between nginx and etcd, I am wondering if we'd benefit from enabling tcp_tw_reuse [13:24:29] the local etcd? [13:24:41] so, etcd is not replying even to the local nginx? [13:25:14] it is local yes, I think it is due to the auth proxy [13:25:34] so all works, but nginx opens every time a tcp socket that lingers for the time_wait timeout [13:27:16] not sure how things would improve, but it should be safe to enable (at least re-reading https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux) [13:27:59] opening a task [13:37:18] 10serviceops, 10Infrastructure-Foundations, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) [13:37:21] done --^ [13:38:09] !log sudo ethtool -G eno1 rx 1000 on conf2004 T345738 [13:38:24] wrong channel I guess [13:38:56] well, timewait sockets dropped a lot [13:39:01] 10serviceops, 10Infrastructure-Foundations, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) Surely not related but I noticed that the conf2xxx nodes hold a ton (8/9k) sockets in TIME_WAIT, most of them related to nginx -> etcd local traffic.... [13:39:21] akosiaris: so you increased the rx buffer right? [13:39:28] yup [13:39:40] tcp errors were also halved [13:39:46] ehm sorry [13:39:48] TCP retransmits [13:40:10] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf2004&var-datasource=thanos&var-cluster=etcd&from=now-3h&to=now&viewPanel=13 is also interesting [13:40:44] time wait sockets dropped to < 1000 [13:40:56] the drop at the end is probably because the rx buffer changed re-initialized the network stack [13:41:17] in dmesg you can see link is down, link is up for eno1 [13:41:36] so, just a result of the "reset", not necessarily a sign of solved issues [13:42:03] akosiaris: this looks good though: [13:42:04] https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=etcd&var-error_rate=0.001&var-slo_latency_threshold=0.032&from=now-3h&to=now&viewPanel=23 [13:42:41] ah, it subsided? that's good, cause I saw the spike a few mins ago and I was wondered whether I botched everything [13:42:50] wondering* [13:43:40] this one has no difference though: https://grafana-rw.wikimedia.org/d/slo-Etcd/etcd-slo-s?from=now-1h&orgId=1&to=now&var-datasource=thanos&var-site=codfw&var-cluster=etcd&viewPanel=4 [13:44:08] I like what you pointed out about https://sal.toolforge.org/log/TTBQaIkBhuQtenzvyrZs, it matches really well with the issue [13:44:21] I pinged Arzhel on #sre, maybe he can give us some input [13:44:41] yeah, but supposedly that switch was removed from the stack without having anything on it [13:45:10] 👀 [13:45:16] o/ [13:45:38] XioNoX: TL;DR is https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf2004&var-datasource=thanos&var-cluster=etcd&from=1689658661804&to=1689729376575&forceLogin&viewPanel=31 [13:45:52] 10serviceops: Sunset onhost memcached on mediawiki servers and puppet - https://phabricator.wikimedia.org/T345740 (10jijiki) p:05Triage→03Low [13:46:13] we believe that TCP retransmit increase, coincides with a) https://sal.toolforge.org/log/TTBQaIkBhuQtenzvyrZs and b) https://grafana-rw.wikimedia.org/d/slo-Etcd/etcd-slo-s?from=now-90d&orgId=1&to=now&var-datasource=thanos&var-site=codfw&var-cluster=etcd&viewPanel=4 [13:46:40] the most we 've done up to now is increase the NIC rx buffer, probably to no avail [13:47:04] nothing is broken to be fair, it is just the etcd request latencies that increased [13:47:08] burning the error budget [13:47:17] well, the error budget is badly burned [13:47:28] also proving that the target was wrong [13:47:41] nobody cared about the lower latency otherwise we would have noticed [13:48:05] but that's expected, this was our first SLO and we were probably way to aspirational [13:48:15] so, as Luca says, no harm done and no urgency [13:48:24] but it would be nice to know what on earth happened [13:48:42] could be true yes, but at the same time something clearly happened, and there is a miss in our SLO procedures right now (namely we don't check the error budget if not at then end of the Q :) [13:48:49] exactly yes [13:48:53] (slow in typing) [13:49:00] (looking) [13:49:21] conf2004 ports are in https://librenms.wikimedia.org/device/device=96/tab=port/port=9232/ btw [13:49:26] I don't think asw-b1-codfw can be related as it was an unused leaf switch [13:49:50] I checked the vc links errors and they're all at 0 [13:50:43] was there anything that may have triggered some issues when the switch was removed? Like a restart of the switches control plane etc.. [13:50:59] it doesn't seem related but the timing is matching really well [13:51:03] taavi: The reboot cookbook failed on these hosts while I was at lunch [13:51:07] so maybe something related to it triggered the issue [13:51:17] unlikely but the timing is troubling [13:51:31] nothing standing out on the interface stats itself [13:52:36] elukey: retransmits are to a specific host? or a set of hosts? [13:52:40] taavi: I'll put them in pooled=invalid so deployments don't fail [13:52:56] upstream links don't show loss neither [13:53:04] ethtool shows for 2004/5/6 rx_discards of some thousand [13:53:10] XioNoX: the bulk of it should be to conf200[56] (the other 2 members of the cluster) [13:55:13] akosiaris: the other 2 hosts are showing the same issues? [13:55:37] no, not quite. [13:56:27] The way we singled out conf2004 is that the other 2 hosts have an increased Quorum GET latency (which requires 2+ answering), way higher than this host [13:56:44] short term I'd suggest trying to bounce the switch port and replace the SFP-T [13:56:57] this is a good view [13:56:58] https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=etcd&var-error_rate=0.001&var-slo_latency_threshold=0.032&viewPanel=23&from=now-90d&to=now [13:56:58] nevermind, not a SFP-T [13:58:34] akosiaris: 2004's latency seems stable now, maybe it is really just a matter of bouncing the port [13:58:37] and/or to reboot [13:59:02] it's the pybal confd host, vgutierrez won't be happy if we reboot it without letting him know [13:59:09] oh hi Valentin :P [13:59:11] :) [13:59:16] maybe removal of the switch caused the flows internal to the VC to be routed differently, but so far I couldn't find any errors on any interfaces [13:59:41] akosiaris: I'd be happy TBH, pybal not so much :) [14:00:36] (meeting bbl) [14:01:48] well, I can switch pybal back and forth. But elukey has a point regarding OS and firmware. They are both old... buster+whatever firmware was bundled for Broadcomm nicks with it [14:02:08] I think replacements are on the way this FY though [14:02:48] actually the fact that the errors show up on the host's RX seems to indicate that they're not anywhere else in the fabric [14:03:01] but more related to the physical link [14:03:09] maybe the cable? [14:03:25] could be too [14:03:55] do we graph that rx_discards somewhere? and other ethtool metrics? [14:04:04] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:06:38] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:08:10] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:08:48] I don't think we have https://github.com/Showmax/prometheus-ethtool-exporter installed in the infra [14:08:51] so, probably no [14:10:41] however we can see those still under node_network_receive_drop_total prometheus metrics [14:11:27] so they are viewable here: https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=conf2004&var-datasource=thanos&var-cluster=etcd&from=1689658661804&to=1689729376575&forceLogin&viewPanel=11&editPanel=11 [14:14:06] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) >>! In T345220#9143356, @ssastry wrote: > /srv/data has the db content from 1001. I would ideally like that copied over. I synched over the data from 1001 to 1002:/srv/data/my... [14:15:24] to summarize, retransmits means dropped packets somewhere between A and Z. Here the other hosts in B3 don't show an increase in retransmits, and drop counters in all of row B switch (including VC) are fine. On the other hand the drop counter on conf2004 RX is increasing, so some packets arriving on the host from the switch are corrupted. Based on that I'd say it's something wrong between the switch TX and the host's RX, so either [14:15:24] NIC or cable. [14:19:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) kubernetes1047 - E 3. U 39. port 36 Cableid 502576 kubernetes1048 - E 3. U 40. port 37 Cableid 502577 kubernetes1049 - E 3. U 41. port 45 Cableid 502578 kubernet... [14:58:53] Counts number of packets dropped by the device due to lack of buffer space. This usually indicates that the host interface is slower than the network interface, or host is not keeping up with the receive packet rate. [14:59:16] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:59:16] I see rx_missed_errors in ip -s -s link show dev eno1 [14:59:34] and they are exactly the same number as rx_discards and mbuf_lwm_thresh_hit [15:01:07] which btw as a number is not increasing? [15:03:18] returning to the reboot question - IIRC for 2004 we just need to file a change in puppet, restart the affected pybals, and the reverse right? [15:03:21] catching up from much earlier: tcp_tw_reuse is decent, but related tcPtw_recycle can be dangerous [15:03:24] we could schedule one [15:03:28] s/P/_/ [15:03:35] ah yes yes [15:04:57] one thing could be to test it via sysctl, it doesn't require reboots of sorts IIRC [15:05:33] (easy to revert as well) [15:35:56] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) So, `ethtool -G eno1 rx 1000` apparently did the [trick](https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=con... [15:47:55] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) >>! In T345220#9146394, @MoritzMuehlenhoff wrote: >>>! In T345220#9143356, @ssastry wrote: >> /srv/data has the db content from 1001. I would ideally like that copied over. > > I synche... [16:24:18] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10cmooney) We removed switch asw-b1-codfw as it no longer had any servers connected (they were moved to cloudsw1-b1-codfw). The correlation between th... [21:26:25] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Urbanecm_WMF) Thank you @akosiaris et al!