[00:30:22] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Rerunning Spark functions with changed settings has no effect - https://phabricator.wikimedia.org/T273210 (10nshahquinn-wmf) [01:46:44] PROBLEM - Host an-worker1098.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:46:52] PROBLEM - Host analytics1073.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:02] PROBLEM - Host an-worker1087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:51:12] PROBLEM - Host an-worker1130.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:04:30] 10Analytics-Jupyter, 10Data-Engineering, 10Product-Analytics: Replace anaconda-wmf with smaller, non-stacked Conda environments - https://phabricator.wikimedia.org/T302819 (10Ottomata) Sounds fine to me! I don't have much of an opinion, so if @aqu and @xcollazo are good with that it should be fine! Sounds... [02:30:55] RECOVERY - Host an-worker1130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [04:51:43] PROBLEM - DNS on an-worker1130.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.156 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:53:32] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10tstarling) I think `is_temp` is fine. I don't think it's future-proof to add `user_type` since if we de... [07:05:38] 10Data-Engineering, 10SRE, 10ops-eqiad: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey) Same thing this morning: ` elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "an-worker1086.mgmt.eqiad.wmnet" -U root -E chassis power status Unable to read password from environment... [08:12:32] FYI; I'm switching the dse-k8s-etcd1001 VM temporarily to DRBD (to allow migration to a different node), latencies may go up temporarily [08:50:22] moritzm: Thanks for the heads-up. 👍 [08:54:45] ACKNOWLEDGEMENT - DNS on an-worker1130.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.0.156 ayounsi https://phabricator.wikimedia.org/T320598 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:02] and switched back to "plain" after migration is completed [09:09:03] Ack, thanks. Is it just that one VM or will you be migrating the other two VMs as well? [09:51:03] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Hi, I'll be your SRE support for today, and will handle de/repooling, destroying th... [10:13:20] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Destroy/apply done in staging: ` # helmfile -e staging status helmfile.yaml: basePa... [10:30:50] 10Analytics, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Iluvatar) [11:16:00] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) @Clement_Goubert Thank you so much! Please let us know if there is anything we need... [11:20:56] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstream` redeployed in codfw. @JArguello-WMF Apart from checking everything i... [11:52:21] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) `eventstream` redeployed in eqiad [11:59:55] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Clement_Goubert) Everything looks healthy from my end, both are getting traffic and not throwing err... [12:37:57] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Ottomata) Interesting! Adding some tags and folks. Would like to look into this. Indeed, if it happens after 15 or 30 minutes, something weird must be... [12:40:44] 10Data-Engineering, 10SRE, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) > eventstreams-internal is still used? I am not sure! I'd imagine folks use it, as it is... [12:52:35] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Ottomata) (BTW, thank you for the very good bug report @Iluvatar :) ) [13:10:21] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Iluvatar) I tried to hardcode `EventSource(url, last_id = None)`, but without effect. I use streams since 2017, but this bug appeared only 6 days ago. Almo... [13:32:56] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10Antoine_Quhen) linked with https://phabricator.wikimedia.org/T273210 [13:54:48] RECOVERY - Host an-worker1098.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.53 ms [13:56:44] RECOVERY - Host an-worker1087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [13:59:26] RECOVERY - Host an-worker1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [13:59:47] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Iluvatar) JavaScript (Browser): https://swviewer.toolforge.org/test.html (see concole): Simple code from Wikitech example: `