[06:08:34] <wikibugs>	 10serviceops, 10Page Content Service, 10RESTBase-API: Mobile HTML endpoint returns an empty response - https://phabricator.wikimedia.org/T345794 (10Brycehughes)
[06:34:18] <wikibugs>	 10serviceops, 10Similarusers: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274 (10Niharika)
[07:04:19] <wikibugs>	 10serviceops, 10Page Content Service, 10RESTBase-API: Mobile HTML endpoint returns an empty response - https://phabricator.wikimedia.org/T345794 (10Brycehughes)
[07:46:31] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi) Thanks, we had a quick chat on IRC about that and indeed that's the current conclusion. The extra details your provided (and fix suggestions...
[08:13:14] <moritzm>	 mc2040 is down, known issue? couldn't find an existing Phab task
[08:15:01] <claime>	 moritzm: Hmm no, it was rebooted 2 days ago
[08:16:32] <claime>	 looks hard down, nothing on console
[08:17:54] <claime>	 DIMM issue
[08:18:08] <claime>	 I'll open a DCops task
[08:21:44] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) @cmooney @ayounsi thanks a lot! On the host side, I'd try two things (not sure if they could help or not):  1) Do a simple reboot. The hosts...
[08:21:44] <elukey>	 hello folks!
[08:21:58] <elukey>	 conf2004 got back to ~8/9k sockets in TIME_WAIT
[08:22:33] <elukey>	 I'd really like to try net.ipv4.tcp_tw_reuse to see if it helps with that, and/or to reboot the node with a fresh kernel
[08:22:41] <akosiaris>	 elukey: I am testing something btw. If I am right, it's kinda mindblowing
[08:22:51] <akosiaris>	 and a very unfortunate coincidence 
[08:23:26] <akosiaris>	 and... what on earth, I may be right
[08:23:27] <akosiaris>	 https://grafana-rw.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=etcd&var-error_rate=0.001&var-slo_latency_threshold=0.032&viewPanel=30
[08:23:31] <elukey>	 akosiaris: this is a click bait statement, you cannot leave me lingering without knowing
[08:23:34] <elukey>	 :D
[08:23:43] <akosiaris>	 I know, I do it on purpose ;-)
[08:24:00] <akosiaris>	 I have been drafting a pretty big response in the task, documenting findings
[08:24:16] <akosiaris>	 and frankly, I was getting absolutely nowhere
[08:24:21] <akosiaris>	 until 5 mins ago
[08:25:06] <akosiaris>	 and also https://grafana.wikimedia.org/d/slo-Etcd/etcd-slo-s?orgId=1&from=now-15m&to=now&viewPanel=4
[08:25:35] <moritzm>	 claime: ack, thanks
[08:25:46] <akosiaris>	 ok, I need to take a breath, but the hint is elukey:
[08:25:50] <akosiaris>	 akosiaris@conf2004:~$ ls -l /var/lib/dpkg/info/cadvisor.list 
[08:25:50] <akosiaris>	 -rw-r--r-- 1 root root 470 Jul 18 09:24 /var/lib/dpkg/info/cadvisor.list
[08:25:58] <claime>	 akosiaris is the buzzfeed of SRE
[08:26:03] <akosiaris>	 right ON THE MINUTE
[08:26:09] <akosiaris>	 on the FREAKING MINUTE
[08:26:12] <claime>	 wtf
[08:26:26] <akosiaris>	 and yes, I just disabled cadvisor on those hosts
[08:26:30] <akosiaris>	 and that just fixed it...
[08:26:39] <akosiaris>	 XioNoX: ^
[08:27:08] <elukey>	 what the...
[08:27:23] <claime>	 wait why are we using cadvisor on hosts that don't run containers?
[08:27:30] <claime>	 Did I miss something?
[08:27:36] <akosiaris>	 o11y was rolling it out everywhere
[08:27:41] <elukey>	 I think it is fleet wide
[08:27:43] <akosiaris>	 there was a task back in may
[08:27:49] <moritzm>	 yeah, it's fleet-wide
[08:28:00] <akosiaris>	 https://phabricator.wikimedia.org/T108027
[08:28:22] <claime>	 Ah because it does cgroup-level metrics
[08:28:25] <claime>	 a'ight
[08:29:03] <akosiaris>	 frankly, this is probably old kernel vs newer kernel
[08:29:11] <akosiaris>	 because conf1* have the exact same cadvisor version
[08:29:26] <akosiaris>	 some code path in the buster kernel is slower?
[08:29:37] <akosiaris>	 wrapping up my comment in the task
[08:30:06] <jayme>	 nice find! :o
[08:30:07] <XioNoX>	 good find!!
[08:30:13] <elukey>	 my chances to get a +1 for tw_reuse got down to zero now :D
[08:30:18] <jayme>	 lol
[08:30:45] <claime>	 x)
[08:31:18] <XioNoX>	 elukey: just use tc secretly :)
[08:40:02] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) Hi  # TL;DR  cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bum...
[08:42:27] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel ver...
[08:44:12] <elukey>	 XioNoX: the coincidence with your switch removal is crazy though
[08:45:27] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) There's a few more actionables here:  1. Re-evaluate our SLO target for conf hosts etcd service. Despite having exhausted the error budget...
[08:46:32] <elukey>	 akosiaris: nice explanation thanks a lot!
[08:53:09] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10fgiunchedi) >>! In T345738#9148513, @akosiaris wrote: > Hi >  > # TL;DR >  > cadvisor is to blame. Adding @fgiunchedi for his information and a thumb...
[08:59:28] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >>  >> # TL;DR >>  >> cad...
[08:59:50] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >>  >> # TL;DR >>  >> ca...
[09:39:49] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Aklapper)
[09:49:44] <elukey>	 folks I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955584 to basically renamed the fastapi-app's chart to something more generic
[09:49:49] <elukey>	 (python-webapp)
[09:49:56] <elukey>	 lemme know if it makes sense :)
[09:50:47] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10MinT, 10SRE, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) 05Open→03Resolved a:03Pginer-WMF Since MinT [was launched](https://diff.wikimedia.org/2023/06/13/mint-support...
[10:20:33] <wikibugs>	 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 - https://phabricator.wikimedia.org/T345812 (10hashar) p:05Triage→03Unbreak!
[10:45:42] <wikibugs>	 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 - https://phabricator.wikimedia.org/T345812 (10Clement_Goubert) p:05Unbreak!→03Medium Since it only impacts one pod, it has a...
[11:01:44] <wikibugs>	 10serviceops, 10Wikimedia-production-error: Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 - https://phabricator.wikimedia.org/T345812 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert A subsequent deployme...
[11:18:44] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) p:05Triage→03Medium Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled pup...
[11:46:17] <wikibugs>	 10serviceops, 10Kubernetes: Wikikube staging clusters are out of IPv4 Pod IP's - https://phabricator.wikimedia.org/T345823 (10JMeybohm) p:05Triage→03High
[11:59:16] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10jijiki) Dear @Jhancock.wm or @Papaul, server can be shutdown and checked at your convenience, this part of the stack has failovers in place. Thank you!
[12:19:21] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Migrate testreduce database from testreduce1001 to testreduce1002 - https://phabricator.wikimedia.org/T345831 (10MoritzMuehlenhoff)
[12:20:39] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) >>! In T345220#9146662, @ssastry wrote: >>>! In T345220#9146394, @MoritzMuehlenhoff wrote: >>>>! In T345220#9143356, @ssastry wrote: >>> /srv/data has the db content from 1001....
[13:30:08] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) I reseated DIMM_A2. I also found this ticket from January T326834. In that one B2 was having errors and I moved it to A2. Now that A2 is having iss...
[13:30:24] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) a:05Papaul→03Jhancock.wm
[13:34:06] <wikibugs>	 10serviceops, 10Kubernetes: Audit charts drift between staging and production - https://phabricator.wikimedia.org/T345839 (10fgiunchedi)
[13:49:51] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) SR: 175477369
[13:57:50] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10ssastry) ` npm WARN old lockfile FetchError: request to https://registry.npmjs.org/@babel%2fhelper-validator-identifier failed, reason: connect ETIMEDOUT 2606:4700::6810:1e22:443 npm WARN old loc...
[14:09:59] <elukey>	 going to deploy api-gateway for a liftwing change
[14:29:25] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Fabfur)
[15:17:18] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) @Fabfur one of the disks is being odd. Is it safe for me to shut down the server at reseat internal components right now?
[15:18:57] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Fabfur) Hi @Jhancock.wm, think the best way is to ask someone from the service operations first
[15:23:18] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10akosiaris) Yes, it is safe, we haven't put yet those in production.
[16:14:42] <wikibugs>	 10serviceops, 10Shellbox: Rename the shellbox service to shellbox-score - https://phabricator.wikimedia.org/T345868 (10RLazarus) p:05Triage→03Low
[16:28:01] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) a:03Jhancock.wm
[18:53:50] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernets1037 - B 8. U 17. port 14 CableID 1796 kubernets1038 - B 8. U 21. port 19 CableID 1801 kubernets1039 - B 8. U 22. port 15 CableID 1797 kubernets1040 - B...
[20:52:46] <wikibugs>	 10serviceops, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite)
[20:56:52] <wikibugs>	 10serviceops, 10ops-codfw: mw2444 down - https://phabricator.wikimedia.org/T345884 (10colewhite)