[02:10:37] 10Quarry: Setup tests framework - https://phabricator.wikimedia.org/T210360 (10Andrew) 05Openβ†’03Resolved a:03Andrew [02:10:41] 10Quarry, 10cloud-services-team (FY2021/2022-Q1): Develop Quarry tests - https://phabricator.wikimedia.org/T210359 (10Andrew) [02:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [08:29:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 (10elukey) Very nice summary Ben :) I agree that the timeout can be increased in the Analytics cluster, but I'd suggest to do it in steps to see how the cluster... [08:32:07] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 (10JAllemandou) Nice catch @elukey! I'd suggest using a different timeout for answers-from-historical (smaller) and overall query (larger). The broker does some w... [08:55:10] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.02 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [09:14:35] seems mediawiki.mediasearch_interaction [09:14:41] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=75&orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&from=now-24h&to=now [09:15:56] yeah makes sense, there was a deploy yesterday https://sal.toolforge.org/log/xYr_m30B8Fs0LHO53sGB [09:16:07] it matches the rise in errors [09:16:42] the error seems to be (from logstash) - '.search_result_page_id' should be integer [09:22:48] mmm if it matches with the deployment it should be something committed there [09:22:57] (for the deployment train) [09:23:04] I'll comment in the train task [09:23:31] Thanks a lot elukey <3 I wouldn't know how to response to this type of error [09:26:52] https://phabricator.wikimedia.org/T293953 [09:27:04] joal: me too, let's see what they say in the task [09:27:09] it seems deployment related [10:02:45] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10JAllemandou) [10:04:49] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.038 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:05:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 (10BTullis) Thanks both, that's a really useful set of insights. Looking into the [[https://druid.apache.org/docs/latest/configuration/index.html|configuration]],... [10:07:57] elukey: Thanks also. [10:11:38] btullis: np! I have a code review for you :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/745484 [10:15:13] elukey: Great, I was wondering about that when I saw it last night. I can offer you a trade :-) https://gerrit.wikimedia.org/r/c/operations/alerts/+/744813 [10:17:22] ahhaha ack will review in a sec [10:18:47] elukey: qq The druid timeout change will require a rolling restart with cookbook, right? i.e. These aren't dynamic configuration values. [10:18:51] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.013 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:18:54] wow - SRE-people trading CRs in plain IRC chan view - the world has changed ;) [10:21:15] btullis: exactly yes, but we could also limit the scope to brokers/historicals in theory [10:22:18] elukey: Got it. Thanks. [10:22:23] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.041 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:23:56] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 4 others: Sticky header: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10ovasileva) 05Openβ†’03Resolved Looks good, resolving. Follow-ups will be docum... [10:24:42] joal: More than happy to make you a deal on a puppet CR too, but I still don't have membership of the Analytics group in gerrit :-) https://gerrit.wikimedia.org/r/admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members [10:25:38] No way btullis!!! this needs to be updated - maybe elukey has rights? [10:27:18] I can't :( probably releng needs to do it [10:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:35:22] np. I'll create a phab task for it at some point. :-) [10:41:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) The loading is still proceeding without any error. Looking at the pattern of compactions, it is clear that we... [10:46:26] !log roll restarting druid brokers on analytics cluster [10:46:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:02:58] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.056 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:08:05] !log roll restarting druid historical daemons on analytics cluster T297148 [11:08:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:08:08] T297148: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 [11:24:07] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.008 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:31:55] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.02 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:56:43] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.084 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [13:39:13] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.009 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:26:06] I am now boosted :) [14:27:19] joal: πŸš€ [14:32:07] joal: \o/ [14:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:55:58] ottomata: o/ [14:56:29] I am going to reimage kafka-main2003 in 20/30 mins, will you be available to check eventgate main just in case somethings fires up? [14:59:01] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Observability-Alerting: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) I think I have an understanding of why this is happening and what we shoul... [15:08:05] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Observability-Alerting: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) Also, while I'm here, I think that the link to Logstash is incorrect on th... [15:14:52] elukey: I might be able to help if ottomata is busy. [15:15:01] super thanks :) [15:15:45] btullis: an interesting tool that you may not have heard of (yet) is purged, it runs on all cpXXXX nodes [15:15:48] https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1 [15:16:15] it pulls from kafka main clusters, and mediawiki pushes PURGE events to kafka [15:16:42] For cache invalidating? [15:16:42] purged reads them, and takes care of local purging [15:16:47] yeah exactly [15:18:01] Gotcha. It used to be this multicast thing, didn't it? https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging [15:18:38] yep, a bit mess basically, varnish was lagging all the time [15:18:44] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 (10odimitrijevic) p:05Mediumβ†’03High [15:19:09] 10Data-Engineering: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10odimitrijevic) p:05Triageβ†’03Medium [15:20:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10odimitrijevic) p:05Triageβ†’03High [15:24:04] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Ottomata) My PR to have Data-Engineering added has been merged. @Urbanecm can we make a Herald rule to do > if tagged with Analytics-Radar remove tag Data-Engineerin... [15:24:52] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Ottomata) a:05Ottomataβ†’03Milimetric [15:27:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Urbanecm) @Ottomata @Milimetric Created {H393} for you. Can you verify the rule is correct, and either resolve the task or request changes? [15:28:56] ottomata: actually...not 100% sure the rule's supposed to be live now. Shout if not :D [15:30:10] 10Analytics-Radar, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Ottomata) [15:31:27] 10Analytics, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Ottomata) [15:34:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Ottomata) > if tagged with Analytics-Radar remove tag Data-Engineering Tested, this works! https://phabricator.wikimedia.org/herald/transcript/4545650/ > if tagged wi... [15:34:40] oh urbanecm the analytics-radar one worked [15:34:55] great! [15:35:00] the +Event Platform -> +Data Engineering one did not [15:35:06] but, maybe the bot takes a while [15:35:07] or. [15:35:15] maybe it has already been done once so it isn't doing again? [15:35:26] depends how you set it [15:35:28] let me check [15:35:44] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Ottomata) [15:48:57] ottomata: yeah, it ignores tasks where the tag was added before (either by a human or the bot) to avoid reverting [15:49:04] it's because you did "once: True" in the bot config [15:49:11] okay, that's fine [15:49:15] lemme test with another then [15:49:19] sure [15:49:48] 10Analytics, 10Event-Platform: dummy test task rule T295397 - https://phabricator.wikimedia.org/T297399 (10Ottomata) [15:49:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Urbanecm) >>! In T295397#7559997, @Ottomata wrote: >> if tagged with Analytics-Radar remove tag Data-Engineering > Tested, this works! > https://phabricator.wikimedia.... [15:50:46] urbanecm: https://phabricator.wikimedia.org/T297399 Data-Engineering was not added [15:51:01] the bot's not realtime. Give it a few minutes please :-) [15:51:42] `Maintenance_bot added a project: Data-Engineering.` [15:51:44] sounds it works? [15:55:48] ottomata: https://phabricator.wikimedia.org/T297400 (as FYI) [15:56:16] unbreak now seems a little brutal [15:56:21] ah awesome urbanecm thank you! [15:56:23] I didn't mean to cause that :D [15:56:33] 10Analytics, 10Data-Engineering, 10Event-Platform: dummy test task rule T295397 - https://phabricator.wikimedia.org/T297399 (10Ottomata) 05Openβ†’03Invalid [15:56:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Ottomata) Ah, it works! https://phabricator.wikimedia.org/T297399 We can resolve this task. [15:57:09] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10Ottomata) 05Openβ†’03Resolved a:05Milimetricβ†’03Ottomata [15:57:14] elukey: it's marked as a train blocker, and blockers UBNs by definition [15:57:20] *are UBNs [15:57:26] elukey: looking [15:59:08] urbanecm: yes yes I mean that it is probably not a train blocker, those events can fail without causing the train to be blocked etc.. [16:00:07] thanks elukey u caused [16:00:11] commented* [16:01:08] ottomata: wait you're looking at T297400? [16:01:09] T297400: '.search_result_page_id' should be integer - https://phabricator.wikimedia.org/T297400 [16:01:45] milimetric: elukey just pinged me above on it [16:01:47] not looking at it anymore :) [16:01:54] I'm about to argue that we ban slack btw, this has been a very confusing thread to follow [16:02:01] oh> [16:02:02] ? [16:02:09] oh [16:02:12] we're talking about it in both places :P [16:02:13] i propose we ban confusing threads instead :D [16:02:17] lol [16:02:19] done [16:02:44] ok, so I commented on https://phabricator.wikimedia.org/T297400 that I'm on-call and if anyone needs help to let me know, but that the instrumentation should be fixed [16:03:33] milimetric: you shouldnt' have to do anything there, that is a problem with some instrumenation code, just finding the right person to fix would be helpful [16:04:27] yeah, that's what I'm saying :) But you know, sometimes we have bugs in our validation code, it's happened [17:02:50] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Data-Engineering-Kanban: Wikistats Bug differing view numbers - https://phabricator.wikimedia.org/T295298 (10Milimetric) 05Openβ†’03Resolved [17:07:52] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10JAllemandou) QA done on 4 million points per day on 4 files, one of them being a know problemat... [17:20:15] 10Analytics, 10Event-Platform: Automate EventGate validation error reporting - https://phabricator.wikimedia.org/T268027 (10Milimetric) Another way to get ownership would be to find the lines of code in our repositories that reference the properties failing validation, the commits that generated them, and the... [17:38:44] 10Analytics, 10Event-Platform, 10Product-Analytics: Develop comprehensive process, guidelines, and roles for Event Platform stream sanitization - https://phabricator.wikimedia.org/T276955 (10Ottomata) [17:39:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Data Catalog Requirements - https://phabricator.wikimedia.org/T294258 (10Ottomata) [17:41:39] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdlrobson) [18:13:11] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Cmjohnson) [18:35:27] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [18:58:07] ottomata: I'm on the test cluster and trying to run `mvn clean -DskipTests install`, and getting network timeouts; I tried setting http_proxy but it appears to not have outbound network anyways. Is there a way to get that to work? [18:58:34] yes, i've found that for mvn you have to set some java properties to use the proxy [18:58:45] oh! Very good [18:59:01] set the http_proxy and https_proxy env var like usual [18:59:02] and then do [18:59:03] -Djava.net.useSystemProxies=true [18:59:22] Ok cool that worked [18:59:50] nice [19:00:33] Now I'm seeing: [19:00:33] ```Plugin org.apache.maven.plugins:maven-checkstyle-plugin:2.9.1 or one of its dependencies could not be resolved: Could not find artifact org.apache.atlas:atlas-buildtools:jar:1.0 in central (https://repo.maven.apache.org/maven2) -> [Help 1]``` [19:00:50] ee [19:02:39] razzi am googling and getting some possible things to try [19:03:15] v [19:03:15] https://community.cloudera.com/t5/Support-Questions/Atlas-build-fails/td-p/211609 [19:03:16] maybe [19:05:08] it's weird because it works locally on my machine, just not remotely on an-test machines, so something seems weird with our Archiva or the proxies [19:12:39] ottomata: doesn't it seem like it's trying to get a version of this jar that doesn't exist? https://repo.maven.apache.org/maven2/org/apache/atlas/atlas-buildtools/ [19:12:45] (but how the heck does it work on my local) [19:19:42] ew, razzi changed the pom.xml to 0.8.1, which is the version that seems available at that address, and it worked [19:19:45] this is such a mess... [19:23:15] hmm, weird milimetric that it works locally [19:23:28] i think that the build on the an-test machines will not use archiva [19:23:43] unless the pom (or in your ~/.m2/settings.xml) tells it to do so [19:23:45] but hmm [19:23:47] yes [19:23:49] i think thats right. [19:24:08] so maybe its something with the proxy, but that is weird [19:35:46] 10Data-Engineering, 10Airflow: Create a SparkSQL runner for cluster-mode deployment - https://phabricator.wikimedia.org/T297427 (10JAllemandou) [19:36:06] 10Data-Engineering, 10Airflow: Create a SparkSQL runner for cluster-mode deployment - https://phabricator.wikimedia.org/T297427 (10JAllemandou) a:03JAllemandou [19:36:22] (03PS1) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [19:38:41] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.03 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [19:41:03] hm, ottomata doesn't this look like all streams are getting a higher rate of validation errors as of ~21:40ish? https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&from=1638992404035&to=1639008430481 [19:42:35] (03CR) 10jerkins-bot: [V: 04-1] Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [19:54:45] 10Analytics-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10gmodena) >>! In T296543#7554872, @Ottomata wrote: > Experimental Dockerfile that does this here: > > https://gist.github.com/ottomata/2fd842a1b3d323579dc9ebe88be724ef > > @gmodena let's sync... [20:02:05] (03PS2) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [20:06:31] (03CR) 10jerkins-bot: [V: 04-1] Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [20:07:23] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:10:34] 10Analytics-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) > Did you consider unifying package management and express pip+conda deps in the same environment.yml file? I did, but then decided we could support it all! If environment.yml cont... [20:11:28] hm - I have suspicions that the haddop nodemanager error is due to this job: https://yarn.wikimedia.org/proxy/application_1637058075222_141140/ [20:11:40] if other errors occur I'll kill it [20:20:15] (03CR) 10Ottomata: "<3 and nit" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [20:26:01] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:00:41] (03PS3) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [21:09:06] (03CR) 10jerkins-bot: [V: 04-1] Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [22:03:07] 10Analytics-Radar, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10odimitrijevic) [22:03:13] 10Analytics-Radar, 10Event-Platform, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10odimitrijevic) [22:31:00] 10Analytics, 10Data-Engineering: [Urgent] Access issues with Wikimedia Developer Accounts / Superset - https://phabricator.wikimedia.org/T297440 (10EYener) [22:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:36:06] 10Analytics, 10Data-Engineering: [Urgent] Access issues with Wikimedia Developer Accounts / Superset - https://phabricator.wikimedia.org/T297440 (10razzi) Hi @Eyener, have you tried resetting your password at https://wikitech.wikimedia.org/wiki/Special:UserLogin ? If that doesn’t help, what exactly do you mea... [22:47:09] 10Analytics, 10Data-Engineering: [Urgent] Access issues with Wikimedia Developer Accounts / Superset - https://phabricator.wikimedia.org/T297440 (10EYener) 05Openβ†’03Resolved a:03EYener Amazing! @razzi Thank you so much! I was trying at this all day and I think what the difference was that it would not re... [22:55:37] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.088 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [23:29:58] 10Data-Engineering, 10SRE, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10nshahquinn-wmf) Data Engineering folks, this ticket needs some input from you 😊 >>! In T252227#6156179, @BBlack wrote: > Before we go all the way down that path, we should... [23:50:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) I attempted to run the install steps on `an-test-coord1001`, but the download requests timed out because it wasn't using the proxy: ` razzi@an-test-coord1001:~/apache-... [23:52:06] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10razzi) One more thing that may be useful @BTullis: @JAllemandou and I started writing a schema for the events that the query logger will produce at https:...