[00:54:17] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:56:35] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:51:04] (03PS4) 10Snwachukwu: Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) [06:04:24] (03PS5) 10Snwachukwu: Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) [08:46:38] (03CR) 10David Caro: [C: 03+1] "LGTM 👍" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/176506 (https://phabricator.wikimedia.org/T76084) (owner: 10Rtnpro) [09:08:16] 10Data-Engineering: Check home/HDFS leftovers of jdl - https://phabricator.wikimedia.org/T306412 (10SCherukuwada) Just checked, these are safe to delete. Thank you for checking with me. [09:56:18] 10Data-Engineering, 10Data-Catalog, 10Epic: Data Catalog POC - https://phabricator.wikimedia.org/T293647 (10BTullis) Should we merge this into {T299910} as a duplicate? [10:15:15] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10Infrastructure-Foundations: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye - https://phabricator.wikimedia.org/T306148 (10ayounsi) 05Resolved→03Open If the device is no more in a failed state please updat... [11:37:53] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10BTullis) A superset contributor has responded to the bug, identifying the specific component in which it occurs and suggesting a worka... [11:40:22] 10Data-Engineering, 10Data-Engineering-Kanban: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10BTullis) 05Open→03Resolved [12:43:51] (03CR) 10Aqu: [C: 03+1] "Looks good." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786448 (https://phabricator.wikimedia.org/T303988) (owner: 10Joal) [12:46:44] (03CR) 10Aqu: [C: 03+1] Add structured_data.commons_entity to purge [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786452 (owner: 10Joal) [13:34:01] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu) [13:52:08] (03PS2) 10Mforns: Remove the GettingStarted* allowlist entries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786291 (https://phabricator.wikimedia.org/T306879) (owner: 10Phuedx) [13:52:27] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Thanks" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786291 (https://phabricator.wikimedia.org/T306879) (owner: 10Phuedx) [13:53:02] (03PS2) 10Mforns: Remove UploadWizard* allowlist entries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/777808 (https://phabricator.wikimedia.org/T305238) (owner: 10Phuedx) [13:53:18] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Thank you!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/777808 (https://phabricator.wikimedia.org/T305238) (owner: 10Phuedx) [14:03:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra: Update HiveToCassandra job to read cassandra password from file - https://phabricator.wikimedia.org/T306895 (10NOkafor-WMF) a:03NOkafor-WMF [14:12:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I am now investigating by capturing network traffic from the eventgate-analytics-external pods and looking... [14:40:48] (03CR) 10Mforns: "Hi, sorry for not responding to this change for so long, I thought it wasn't on me to respond, but it was. Left a couple comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [15:23:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:28:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:41:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:46:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:57:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:24:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:49:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:08:20] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Well, this is a bit confusing. I've examined packet captures from two pods in eqiad and another in codfw.... [17:35:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have a few errors logged by ats-be attempting to connect to `eventgate-analytics-external.discovery.wmne... [17:51:35] 10Data-Engineering, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Htriedman) [17:54:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) > perhaps this is a client browser opening a connection but sending an empty POST body This seems likely,...