[11:17:05] 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583#10386062 (10Fabfur) 05Open→03In progress [11:17:54] 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583#10386064 (10Fabfur) Latest package with this feature has been released, moving this task to "In progress" and waiting some time for confirmation [13:24:20] 06Traffic, 10Data-Engineering (Q2 2024 October 1st - December 31th): Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578#10386462 (10Fabfur) 05Open→03In progress [15:21:16] 10Wikimedia-Apache-configuration, 06Security-Team, 07Security: https://www.mediawiki.org/.well-known/change-password redirects to HTTP - https://phabricator.wikimedia.org/T381625#10386732 (10sbassett) This is pretty low-risk as we force TLS for Wikimedia wikis. I'm going to make this public since there is n... [15:21:17] 10Wikimedia-Apache-configuration, 06Security-Team, 07Security: https://www.mediawiki.org/.well-known/change-password redirects to HTTP - https://phabricator.wikimedia.org/T381625#10386733 (10sbassett) [15:21:29] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10386737 (10RobH) >>! In T373993#10385350, @BCornwall wrote: > Some observations: > > * [[ https://grafana.wikimedia.org/goto/_53fKoVHR?orgId=1 | magru has the highest ave... [17:13:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:13:07] again [17:13:32] Dec 06 17:10:18 cp5022 purged[2011572]: 2024/12/06 17:10:18 Recoverable error (code -185) while reading from kafka: ssl://kafka-main2009.codfw.wmnet:9093/2004: 1 request(s) timed out: disconnect (after 700895195ms in state UP, 1 identical> [17:13:36] whta's up with this now [17:14:19] also on cp5031 [17:15:39] effie: I suppose there are no ongoing activities, correct? [17:15:57] fabfur: yeah, nothing in SAL I think at least [17:16:08] but I do see network errors for the same time [17:16:17] for kafka-main2009 [17:17:31] fabfur: I am done :) [17:17:35] https://grafana.wikimedia.org/goto/CXXuSJ4Hg?orgId=1 looks like it's recovering [17:17:45] effie: thanks, was just for confirmation :D [17:17:55] this is also again codfw, eqiad was the one we were working on this week [17:18:00] RESOLVED: [6x] PurgedHighEventLag: High event process lag with purged on cp5022:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:18:03] ok thanks [17:18:04] a temporary network error then? [17:20:18] fabfur: on the kakfa host, can definitely see the errors in the kafka service but not sure how to proceed beyond that [17:21:05] that usually recovers, especially if there are no other alerts related, IMHO [17:23:06] yeah seems to be OK again and no more purged alerts [17:28:00] FIRING: [6x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:28:31] I suspect we have network issues at play here [17:28:35] if you look at the puppet failures [17:33:00] RESOLVED: [32x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag