[11:17:05] <wikibugs>	 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583#10386062 (10Fabfur) 05Open→03In progress
[11:17:54] <wikibugs>	 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583#10386064 (10Fabfur) Latest package with this feature has been released, moving this task to "In progress" and waiting some time for confirmation
[13:24:20] <wikibugs>	 06Traffic, 10Data-Engineering (Q2 2024 October 1st - December 31th): Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578#10386462 (10Fabfur) 05Open→03In progress
[15:21:16] <wikibugs>	 10Wikimedia-Apache-configuration, 06Security-Team, 07Security: https://www.mediawiki.org/.well-known/change-password redirects to HTTP - https://phabricator.wikimedia.org/T381625#10386732 (10sbassett) This is pretty low-risk as we force TLS for Wikimedia wikis.  I'm going to make this public since there is n...
[15:21:17] <wikibugs>	 10Wikimedia-Apache-configuration, 06Security-Team, 07Security: https://www.mediawiki.org/.well-known/change-password redirects to HTTP - https://phabricator.wikimedia.org/T381625#10386733 (10sbassett)
[15:21:29] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10386737 (10RobH) >>! In T373993#10385350, @BCornwall wrote: > Some observations: >  > * [[ https://grafana.wikimedia.org/goto/_53fKoVHR?orgId=1 | magru has the highest ave...
[17:13:00] <jinxer-wm>	 FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[17:13:07] <sukhe>	 again
[17:13:32] <sukhe>	 Dec 06 17:10:18 cp5022 purged[2011572]: 2024/12/06 17:10:18 Recoverable error (code -185) while reading from kafka: ssl://kafka-main2009.codfw.wmnet:9093/2004: 1 request(s) timed out: disconnect (after 700895195ms in state UP, 1 identical>
[17:13:36] <sukhe>	 whta's up with this now
[17:14:19] <sukhe>	 also on cp5031
[17:15:39] <fabfur>	 effie: I suppose there are no ongoing activities, correct? 
[17:15:57] <sukhe>	 fabfur: yeah, nothing in SAL I think at least
[17:16:08] <sukhe>	 but I do see network errors for the same time 
[17:16:17] <sukhe>	 for kafka-main2009
[17:17:31] <effie>	 fabfur: I am done :)
[17:17:35] <fabfur>	 https://grafana.wikimedia.org/goto/CXXuSJ4Hg?orgId=1 looks like it's recovering
[17:17:45] <fabfur>	 effie: thanks, was just for confirmation :D 
[17:17:55] <effie>	 this is also again codfw, eqiad was the one we were working on this week 
[17:18:00] <jinxer-wm>	 RESOLVED: [6x] PurgedHighEventLag: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[17:18:03] <sukhe>	 ok thanks
[17:18:04] <fabfur>	 a temporary network error then?
[17:20:18] <sukhe>	 fabfur: on the kakfa host, can definitely see the errors in the kafka service but not sure how to proceed beyond that
[17:21:05] <fabfur>	 that usually recovers, especially if there are no other alerts related, IMHO
[17:23:06] <sukhe>	 yeah seems to be OK again and no more purged alerts 
[17:28:00] <jinxer-wm>	 FIRING: [6x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[17:28:31] <sukhe>	 I suspect we have network issues at play here
[17:28:35] <sukhe>	 if you look at the puppet failures
[17:33:00] <jinxer-wm>	 RESOLVED: [32x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag