[05:04:28] <wikibugs>	 10serviceops, 10sre-alert-triage: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10Joe) 05Open→03Resolved a:03Joe Yes, we forgot to run `systemctl reset-failed` on mwmaint2002.
[06:41:18] <elukey>	 hello folks
[06:41:28] <elukey>	 I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/942061 for kafka main, to raise threads a little more
[06:41:34] <elukey>	 lemme know your thoughts
[06:42:29] <akosiaris>	 elukey: +1ed, but I would like to know why 60% and not e.g. 55% or 70% 
[06:42:49] <akosiaris>	 aka how did the magic number come up? a roll of a d20 is ok as an answer btw :P
[06:43:11] <akosiaris>	 although I 'd love to see you come up with 60% from a d20 
[06:43:51] <akosiaris>	 it doesn't matter much btw, which is why I +1ed
[06:44:37] <elukey>	 akosiaris: o/ nono it is good to brainbounce, it is a value more inline with other clusters, even 70% would be good, I picked 60% in my head as minimum healthy threshold :)
[06:45:23] <akosiaris>	 ok, then I 'd ask to add that reasoning (inline with the other clusters) to the commit message
[06:45:31] <elukey>	 sure sure
[06:45:36] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-24h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=logging-eqiad&var-cluster=logstash&var-kafka_broker=All&var-disk_device=All&viewPanel=75
[06:45:53] <elukey>	 this is what we have now for logging-eqiad, that was around 30/40%
[06:46:02] <elukey>	 but it is less busy than main-eqiad
[06:46:42] <elukey>	 updated :)
[06:47:27] <akosiaris>	 awesome, thanks!
[06:48:06] <elukey>	 rolling out :)
[08:00:42] <claime>	 akosiaris: 60% is 12 on a d20
[08:00:46] <claime>	 :p
[08:01:11] <akosiaris>	 elukey: there, see? that's how it's done :P
[08:44:34] <wikibugs>	 10serviceops, 10MW-on-K8s: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Joe) sadly more issues were found under `conf/`:  * activeMWVersions.php **shells out to scap** which in turn just json decodes a file on disk. * index.php tries to read `/etc/conftool-state/mediawiki...
[09:14:21] <elukey>	 speaking of numbers :D
[09:14:21] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=75
[09:15:13] <elukey>	 with a rebalance it will probably spread more evenly, but it looks ok now
[09:15:30] <elukey>	 I am going to add an alarm for this metric, warning at 30% and critical at 20%
[09:15:34] <elukey>	 so we know when it happens
[09:25:47] <akosiaris>	 wow, nice
[10:27:59] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10Clement_Goubert) From wayback machine, the magic google formula is:  > Allocatable resources are calculated in the following way: >  > ALLOCATABLE = CAPACITY...
[10:39:00] <duesen>	 _joe_: I summarized our conversation on https://www.mediawiki.org/wiki/User_talk:DKinzler_(WMF)/API_URL_Guidelines#Conversation_with_Giuseppe
[10:39:09] <duesen>	 please check that I didn't get anything wrong
[10:39:19] <_joe_>	 duesen: you rat me out like that? :D
[10:39:23] <_joe_>	 and yes, will do
[10:39:32] <duesen>	 I can anonymize if you like
[10:46:59] <_joe_>	 nah I'm joking
[10:47:11] <_joe_>	 also my name is the most common name in italy still :D
[10:47:23] <_joe_>	 (I think, might not be true anymore)
[11:38:53] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10hnowlan) Looking at the spec, it appears there are a few more endpoints not covered: https://github.com/wik...
[11:43:09] <elukey>	 and kafka main-codfw completed as well 
[11:57:21] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10Clement_Goubert) Summary from a realtime discussion with @JMeybohm - Using `--system-reserved` is kind of dangerous, because it uses cgroups and may lead to O...
[12:00:22] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) These were never exposed in public traffic so I don't think it needs to be exposed now (also un...
[12:12:04] <wikibugs>	 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Clement_Goubert) We had a long but productive discussion with @JMeybohm this morning, resulting in a tentative plan of action:    # Graph the global latency of wikikube hosted se...
[12:24:18] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) Regarding aggregated are you referring to the `query aggregated = true` for example here? https...
[12:57:41] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10elukey) The stability of the kafka main cluster is now way better, they are not totally rebalanced but this ca...
[13:11:43] <elukey>	 hnowlan, kamila_ - time to deploy changeprop in codfw? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/941780
[13:12:26] <elukey>	 also, to all serviceops --^ We are ready to update changeprop, namely move from stretch -> buster and update the noderdkafka client
[13:12:44] <hnowlan>	 elukey: sgtm!
[13:14:53] <elukey>	 hnowlan: ack, let's do it then!
[13:15:09] <elukey>	 do you want me to deploy or will you?
[13:15:51] <hnowlan>	 I can do it 
[13:16:30] <hnowlan>	 I'll start with changeprop codfw
[13:16:48] <elukey>	 super
[13:17:03] <elukey>	 we can leave it running for some days in there to be sure, before hitting eqiad
[13:17:21] <hnowlan>	 cool
[13:18:00] <hnowlan>	 https://media.tenor.com/aIZrrqGuOFwAAAAd/hold-on-to-your-butts-jurrasic-park.gif
[13:20:58] <hnowlan>	 rule processing coming back up 
[13:22:01] <hnowlan>	 backlogs back down 
[13:23:58] <elukey>	 lol
[13:25:40] <elukey>	 the container cpu usage grew a lot, maybe it is only temporary after the deployment
[13:28:01] <hnowlan>	 yeah there's frequently a spike just after
[13:29:28] <elukey>	 so far everything seems good
[13:33:22] <elukey>	 it doesn't climb down, wondering what that is
[13:37:12] <hnowlan>	 hmm yeah 
[13:37:31] <hnowlan>	 network throughput is also a bit weirdly low comparatively 
[13:43:08] <elukey>	 my first thought was that the pods were reprocessing data
[13:43:19] <elukey>	 but we would have a huge backlog
[13:43:37] <elukey>	 we updated os + librdkafka, probably some setting is to fine tune
[13:46:07] <hnowlan>	 yeah. I would have liked to see a big pump in throughput alongside CPU usage bump but alas. I'd be tempted to see what a roll-restart would cause just in case there's been some imbalance (8 pods used out of 12, many handling 2 topics) 
[13:47:01] <elukey>	 where do you see the 8/12 pods used?
[13:47:03] <hnowlan>	 even if that is a factor it's not the only one I'd guess, this is a somewhat unprecedented bump 
[13:47:24] <hnowlan>	 elukey: https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&from=now-30m&to=now&var-dc=codfw%20prometheus%2Fk8s&viewPanel=80 
[13:48:07] <elukey>	 ah TIL!
[13:49:55] <akosiaris>	 I see normal processing metrics again at https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=5m&from=now-30m&to=now&viewPanel=9
[13:50:05] <akosiaris>	 why did that take... 16 minutes to happen ? 
[13:51:57] <hnowlan>	 that... is bizarre - because that gap starts about 14 minutes *after* the deploy. beforehand it was processing things fine
[13:53:01] <elukey>	 the containers don't seem throttled so far
[13:54:54] <akosiaris>	 yeah I was checking at limits and it's 3 CPU
[13:55:06] <akosiaris>	 so we are definitely not starving those pods
[14:02:28] <elukey>	 something interesting - https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=5m&from=now-30d&to=now&var-dc=eqiad%20prometheus%2Fk8s-staging&viewPanel=56
[14:02:42] <elukey>	 this is staging, and on the 13th we deployed
[14:03:14] <elukey>	 the usage is mostly the same, and staging doesn't process anything basically 
[14:03:44] <elukey>	 so it must be something that does busy waiting or similar
[14:04:41] <elukey>	 mmm https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=5m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&viewPanel=34
[14:04:44] <elukey>	 this is very weird
[14:05:00] <elukey>	 and TTL is going up a lot as well
[14:06:36] <elukey>	 ah no wait my bad, it started hours ago
[14:06:38] <elukey>	 scratch that
[14:06:48] <elukey>	 I got fooled by the 30d scale
[14:13:49] <elukey>	 hnowlan: we can rollback and investigate in staging
[14:16:26] <hnowlan>	 yeah, sounds good 
[14:17:00] <hnowlan>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/942005 
[14:17:49] <elukey>	 looks good, very sad though :(
[14:17:55] <elukey>	 I bet it is a librdkafka setting
[14:18:04] <hnowlan>	 yeah
[14:18:11] <hnowlan>	 I wonder if it's as simple as poll frequency or something
[14:18:26] <hnowlan>	 there was a point where we were *trying* to increase CPU usage by changeprop in the past heh
[14:18:59] <hnowlan>	 tbh that metrics gap and general disappearing metrics on changeprop are such a killer
[14:19:10] <hnowlan>	 we can't differentiate between a crisis and a long-running worker, it's vexing 
[14:19:11] <elukey>	 definitely
[14:20:28] <hnowlan>	 I'd propose that next time we bump to buster and deploy, then do the librdkafka bump 
[14:20:37] <hnowlan>	 just to give us a diff
[14:26:32] <elukey>	 could be an option yes, I'll try to debug in staging the busy wait
[14:26:43] <elukey>	 never done it on a container so I'll learn something :D
[14:34:00] <hnowlan>	 :D
[14:35:30] <claime>	 gl;hf
[14:35:32] <claime>	 :p
[14:46:41] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 3 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE)
[15:13:01] <elukey>	 hnowlan: one qs - did we also change the nodejs version with the upgrade?
[15:13:50] <hnowlan>	 elukey: no
[15:15:48] <elukey>	 hnowlan: weird, perf shows that most of the cpu usage is spent in libnode.so
[15:19:10] <elukey>	 mmm no maybe it was the wrong pod, lemme see
[15:35:07] <hnowlan>	 sorry, I'm wrong! 
[15:35:17] <elukey>	 ok so I get libuv and libnode as at fault
[15:35:22] <elukey>	 and I see https://github.com/Blizzard/node-rdkafka/releases/tag/v2.14.5
[15:35:25] <hnowlan>	 buster: 10.24.0~dfsg-1~deb10u3 
[15:35:27] <elukey>	 Update node engine requirement to >= 14
[15:35:28] <hnowlan>	 stretch: 10.15.2~dfsg-1+wmf1
[15:36:14] <elukey>	 well it is always node 10 so we should be good :)
[15:36:52] <elukey>	 most of the time is spent in 
[15:36:52] <elukey>	                uv__server_io
[15:36:52] <elukey>	                v8::Value::IsFunction
[15:36:57] <elukey>	 but not sure if it is normal or not
[15:37:50] <elukey>	 we could also try what Andrew suggested in https://phabricator.wikimedia.org/T341140#8993746
[15:43:49] <elukey>	 now I have
[15:43:49] <elukey>	                uv__server_io
[15:43:49] <elukey>	                node::ConnectionWrap<node::TCPWrap, uv_tcp_s>::OnConnection
[15:43:52] <elukey>	                node::AsyncWrap::MakeCallback
[15:43:55] <elukey>	                node::InternalMakeCallback
[15:48:09] <hnowlan>	 hmm
[15:49:41] <hnowlan>	 I'd love to see what would have been in the debug logs during that. Seeing ConnectionWrap would make me wonder if it was repeatedly trying and failing to make connections or something 
[15:50:51] <elukey>	 pasted the stacktraces in the task
[15:51:00] <wikibugs>	 10serviceops, 10ChangeProp, 10WMF-JobQueue: Check if node-rdkafka's version on changeprop can be upgraded from 2.8.1 - https://phabricator.wikimedia.org/T341140 (10elukey) We noticed a sharp and sustained increase in cpu usage after the deployment to prod codfw, and we rolled it back.  The issue got unnotice...
[15:52:31] <elukey>	 hnowlan: nothing really useful in the logs afaics
[15:53:11] <elukey>	 or maybe it is the librdkafka callback, if we use any?
[15:56:53] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10hnowlan) As regards aggregated stuff, I mean the difference between https://wikifeeds.discovery.wmnet:4101/...
[15:57:34] <hnowlan>	 oh ofc sorry, you're testing this on staging, oops. 
[15:59:52] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 3 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10Jgiannelos) The `/aggregated` path is to be exposed for the restbase replacement, the other shouldn't be ex...
[16:45:51] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10SRE, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF)
[17:25:28] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye
[18:00:47] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye completed: - rdb1013 (**PASS**)   - Removed from P...
[18:01:24] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jhancock.wm)
[18:14:02] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye
[19:35:16] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye completed: - rdb1014 (**PASS**)   - Removed from P...
[19:36:34] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jhancock.wm)
[20:00:27] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jhancock.wm) 05Stalled→03Resolved @akosiaris install is complete
[20:18:11] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH)