[06:08:34] 10serviceops, 10Page Content Service, 10RESTBase-API: Mobile HTML endpoint returns an empty response - https://phabricator.wikimedia.org/T345794 (10Brycehughes) [06:34:18] 10serviceops, 10Similarusers: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274 (10Niharika) [07:04:19] 10serviceops, 10Page Content Service, 10RESTBase-API: Mobile HTML endpoint returns an empty response - https://phabricator.wikimedia.org/T345794 (10Brycehughes) [07:46:31] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi) Thanks, we had a quick chat on IRC about that and indeed that's the current conclusion. The extra details your provided (and fix suggestions... [08:13:14] mc2040 is down, known issue? couldn't find an existing Phab task [08:15:01] moritzm: Hmm no, it was rebooted 2 days ago [08:16:32] looks hard down, nothing on console [08:17:54] DIMM issue [08:18:08] I'll open a DCops task [08:21:44] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) @cmooney @ayounsi thanks a lot! On the host side, I'd try two things (not sure if they could help or not): 1) Do a simple reboot. The hosts... [08:21:44] hello folks! [08:21:58] conf2004 got back to ~8/9k sockets in TIME_WAIT [08:22:33] I'd really like to try net.ipv4.tcp_tw_reuse to see if it helps with that, and/or to reboot the node with a fresh kernel [08:22:41] elukey: I am testing something btw. If I am right, it's kinda mindblowing [08:22:51] and a very unfortunate coincidence [08:23:26] and... what on earth, I may be right [08:23:27] https://grafana-rw.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=etcd&var-error_rate=0.001&var-slo_latency_threshold=0.032&viewPanel=30 [08:23:31] akosiaris: this is a click bait statement, you cannot leave me lingering without knowing [08:23:34] :D [08:23:43] I know, I do it on purpose ;-) [08:24:00] I have been drafting a pretty big response in the task, documenting findings [08:24:16] and frankly, I was getting absolutely nowhere [08:24:21] until 5 mins ago [08:25:06] and also https://grafana.wikimedia.org/d/slo-Etcd/etcd-slo-s?orgId=1&from=now-15m&to=now&viewPanel=4 [08:25:35] claime: ack, thanks [08:25:46] ok, I need to take a breath, but the hint is elukey: [08:25:50] akosiaris@conf2004:~$ ls -l /var/lib/dpkg/info/cadvisor.list [08:25:50] -rw-r--r-- 1 root root 470 Jul 18 09:24 /var/lib/dpkg/info/cadvisor.list [08:25:58] akosiaris is the buzzfeed of SRE [08:26:03] right ON THE MINUTE [08:26:09] on the FREAKING MINUTE [08:26:12] wtf [08:26:26] and yes, I just disabled cadvisor on those hosts [08:26:30] and that just fixed it... [08:26:39] XioNoX: ^ [08:27:08] what the... [08:27:23] wait why are we using cadvisor on hosts that don't run containers? [08:27:30] Did I miss something? [08:27:36] o11y was rolling it out everywhere [08:27:41] I think it is fleet wide [08:27:43] there was a task back in may [08:27:49] yeah, it's fleet-wide [08:28:00] https://phabricator.wikimedia.org/T108027 [08:28:22] Ah because it does cgroup-level metrics [08:28:25] a'ight [08:29:03] frankly, this is probably old kernel vs newer kernel [08:29:11] because conf1* have the exact same cadvisor version [08:29:26] some code path in the buster kernel is slower? [08:29:37] wrapping up my comment in the task [08:30:06] nice find! :o [08:30:07] good find!! [08:30:13] my chances to get a +1 for tw_reuse got down to zero now :D [08:30:18] lol [08:30:45] x) [08:31:18] elukey: just use tc secretly :) [08:40:02] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) Hi # TL;DR cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bum... [08:42:27] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel ver... [08:44:12] XioNoX: the coincidence with your switch removal is crazy though [08:45:27] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) There's a few more actionables here: 1. Re-evaluate our SLO target for conf hosts etcd service. Despite having exhausted the error budget... [08:46:32] akosiaris: nice explanation thanks a lot! [08:53:09] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10fgiunchedi) >>! In T345738#9148513, @akosiaris wrote: > Hi > > # TL;DR > > cadvisor is to blame. Adding @fgiunchedi for his information and a thumb... [08:59:28] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> cad... [08:59:50] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> ca... [09:39:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Aklapper) [09:49:44] folks I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955584 to basically renamed the fastapi-app's chart to something more generic [09:49:49] (python-webapp) [09:49:56] lemme know if it makes sense :) [09:50:47] 10serviceops, 10Machine-Learning-Team, 10MinT, 10SRE, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) 05Open→03Resolved a:03Pginer-WMF Since MinT [was launched](https://diff.wikimedia.org/2023/06/13/mint-support... [10:20:33] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 - https://phabricator.wikimedia.org/T345812 (10hashar) p:05Triage→03Unbreak! [10:45:42] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 - https://phabricator.wikimedia.org/T345812 (10Clement_Goubert) p:05Unbreak!→03Medium Since it only impacts one pod, it has a... [11:01:44] 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 - https://phabricator.wikimedia.org/T345812 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert A subsequent deployme... [11:18:44] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) p:05Triage→03Medium Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled pup... [11:46:17] 10serviceops, 10Kubernetes: Wikikube staging clusters are out of IPv4 Pod IP's - https://phabricator.wikimedia.org/T345823 (10JMeybohm) p:05Triage→03High [11:59:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10jijiki) Dear @Jhancock.wm or @Papaul, server can be shutdown and checked at your convenience, this part of the stack has failovers in place. Thank you! [12:19:21] 10serviceops, 10Parsoid (Tracking): Migrate testreduce database from testreduce1001 to testreduce1002 - https://phabricator.wikimedia.org/T345831 (10MoritzMuehlenhoff) [12:20:39] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) >>! In T345220#9146662, @ssastry wrote: >>>! In T345220#9146394, @MoritzMuehlenhoff wrote: >>>>! In T345220#9143356, @ssastry wrote: >>> /srv/data has the db content from 1001.... [13:30:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) I reseated DIMM_A2. I also found this ticket from January T326834. In that one B2 was having errors and I moved it to A2. Now that A2 is having iss... [13:30:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [13:34:06] 10serviceops, 10Kubernetes: Audit charts drift between staging and production - https://phabricator.wikimedia.org/T345839 (10fgiunchedi) [13:49:51] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) SR: 175477369 [13:57:50] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) ` npm WARN old lockfile FetchError: request to https://registry.npmjs.org/@babel%2fhelper-validator-identifier failed, reason: connect ETIMEDOUT 2606:4700::6810:1e22:443 npm WARN old loc... [14:09:59] going to deploy api-gateway for a liftwing change [14:29:25] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Fabfur) [15:17:18] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) @Fabfur one of the disks is being odd. Is it safe for me to shut down the server at reseat internal components right now? [15:18:57] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Fabfur) Hi @Jhancock.wm, think the best way is to ask someone from the service operations first [15:23:18] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10akosiaris) Yes, it is safe, we haven't put yet those in production. [16:14:42] 10serviceops, 10Shellbox: Rename the shellbox service to shellbox-score - https://phabricator.wikimedia.org/T345868 (10RLazarus) p:05Triage→03Low [16:28:01] 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) a:03Jhancock.wm [18:53:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernets1037 - B 8. U 17. port 14 CableID 1796 kubernets1038 - B 8. U 21. port 19 CableID 1801 kubernets1039 - B 8. U 22. port 15 CableID 1797 kubernets1040 - B... [20:52:46] 10serviceops, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite) [20:56:52] 10serviceops, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite)