[01:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [01:09:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [01:16:13] (03PS1) 10Milimetric: Add bjn.wikibooks to the pageview allow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/746982 [01:16:29] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add bjn.wikibooks to the pageview allow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/746982 (owner: 10Milimetric) [02:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [08:59:35] Hi btullis - it seems aqs new cluster is in a weird state - loading jobs for the two big tables have failed :( [09:00:15] actually, not yet failed, but are very slow (still running), with failed tasks [13:23:16] btullis, elukey - I see a lot of socket-errors on AQS-new hosts - this feels weird :( [13:24:26] btullis: we laso have a very high file-system usage - we were expecting a lot less - could it be related to snapshots taken not released or anything like that? [13:24:29] (new hosts) [13:48:10] joal: Thanks. investigating now. I don't think that we've created any snapshots on the new hosts. [13:50:22] I need to do a rolling reboot of the whole aqs_next cluster at some point, but I will investigate this first. Also I need to do this: T297483 [13:50:22] T297483: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 [14:12:08] joal: do you need more help on that failed build? [14:15:53] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) [14:19:33] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Tests are successful: I tested it by configuring sflow on the non-yet-prod asw1-b12-drmrs switch: `lang=diff [edit protoc... [14:23:15] gehel: no thank you, all good :) [14:23:22] cool! [14:25:35] !log btullis@aqs1011:$ sudo systemctl start cassandra-b.service [14:25:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:34:39] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team: Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) This is this kind of thing we need to have a way to reconcile: https://wikitech.wikimedia.org/wiki/Incident_... [14:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:36:36] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as a Source of Truth - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) Perhaps a better titled would be "[[ https://martinfowler.com/articles/201701-event-driven.h... [14:42:54] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) From wikitech: https://wikitech.wikimedia.org/wiki/Cassandra#Replicating_system_auth > Authentication and authorization... [14:46:57] ottomata: yt? I have some problems with poetry and pyarrow/numpy, did you have the same issues? [14:47:13] (with worflow_utils repo) [14:52:38] mforns: yes hello! [14:52:52] heya :] [14:52:52] mforns: i don't think so, but i didn't test extensiveily [14:53:03] mostly that things were just installed, what's the issue? [14:54:05] when I try to poetry install it conflicts: pyarrow's only version? 0.16.0 forces a version of numpy that does not support python3.7 [14:54:22] don't worry, will continue trying here [14:54:31] just wondering if you had seen it before [14:55:39] wait I pasted all the wrong versions [14:58:18] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) I can't see any issues with the current process either, apart from the fact that possibly we didn't use the `--full` opt... [14:59:18] !log cassandra@cqlsh> ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; on aqs1010-a [14:59:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:59:32] heheh [14:59:36] was trying to parse mforns [14:59:47] pyarrow 6.0.1 [14:59:47] and [14:59:53] numpy 1.21.1 [14:59:54] right? [15:00:01] ok, poetry says pyarrow 6.0.1 is not compatible with numpy 1.21.1 [15:00:02] why is numpy even in there? [15:00:25] I think it's a pyarrow dependency, no? [15:00:29] oh is it? [15:00:37] !log btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth [15:00:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:01:15] pyarrow says [15:01:15] numpy = ">=1.16.6" [15:01:25] yes [15:01:28] mforns: got some code i can try and repro? [15:01:36] or is it when you install? [15:01:43] the poetry.lock file looks good [15:01:47] the error message also says: The current project's Python requirement (3.6.9)... [15:01:57] which is weird, where is that specified, it should be 3.7 [15:02:05] mforns: [15:02:06] python --version [15:02:07] and [15:02:10] poetry run python --version [15:02:10] ? [15:02:14] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) This completed successfully: Now beginning to repair the system_auth table. ` btullis@aqs1010:~$ sudo nodetool-a repair... [15:02:42] ottomata: shouldn't it be: python3 --version? [15:02:59] Python 3.7.5 [15:02:59] mforns: depends on your env, unlikely if using poetry venv or conda [15:03:50] https://www.irccloud.com/pastebin/CiJEV8At/ [15:05:18] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) Similarly the repair for aqs1010-b completed successfully. ` btullis@aqs1010:~$ sudo nodetool-b repair --full system_aut... [15:06:59] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [15:06:59] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:07:07] oh interesting mforns [15:07:09] that means your [15:07:18] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [15:07:18] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:07:21] mforns: do poetry run which python [15:07:39] ^ Oh, this alert is down to what I am doing. I will ack it now. [15:07:47] /home/mforns/.cache/pypoetry/virtualenvs/workflow-utils-1T6eci4A-py3.7/bin/python [15:07:58] wipe that [15:07:59] hm [15:08:00] rm -rf /home/mforns/.cache/pypoetry/virtualenvs/workflow-utils-1T6eci4A-py3.7 [15:08:04] something is wrong :) [15:08:06] or maybe [15:08:07] actually [15:08:11] maybe you can do [15:08:29] poetry add python=3.7.9 [15:08:35] or is it python==3.7.9 [15:09:41] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return [15:09:41] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:10:32] not working.. [15:10:41] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:10:49] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:11:19] mforns: ? [15:11:30] reading https://github.com/python-poetry/poetry/issues/655 [15:11:33] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:17:55] hm [15:21:21] ottomata: do you use pyenv? [15:26:17] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) I ended up not doing the remaining 10 repairs with cumin, but manually. We started getting 500 errors shortly after carr... [15:27:18] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) The 500 errors stopped shotly after the repair commands were issued on aqs1011, but there's still no definitive answer a... [15:27:44] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) p:05Triage→03High [15:28:01] HEAR YE HEAR YE [15:28:16] The weekly deployment train is boarding [15:28:41] heya milimetric :] [15:28:43] hi! [15:28:59] there's only a sanitize allow list change on there, feel free to add stuff, I can deploy after meetings [15:29:05] can I add the changes to RSVDAnomalyDetection to the train? [15:29:17] Nothing from me, thanks milimetric. [15:29:41] mforns: I'll hold the train as long as you need. Do you need review? [15:30:37] milimetric: the thing is tested and retested, but still needs a review, the deployment will be a no-op, because no oozie job is using that version of the RSVD code (this version is for airflow) [15:30:47] milimetric: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/707517 [15:31:02] (03CR) 10Mforns: "This change is ready for review." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [15:31:27] mforns: but this has to get merged too, no? (https://gerrit.wikimedia.org/r/c/analytics/refinery/+/702668) [15:31:54] milimetric: no no, that will go somewhere else! [15:32:08] batcave? maybe I can help, ops week is awfully quiet [15:32:49] ok [15:35:35] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10elukey) >>! In T297483#7569667, @BTullis wrote: > The 500 errors stopped shotly after the repair commands were issued on aqs1011,... [15:35:44] (03CR) 10Mforns: "This can be merged now, independently from https://gerrit.wikimedia.org/r/c/analytics/refinery/+/702668" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [15:38:56] I'm planning to do a rolling reboot of the cassandra servers in the aqs_next cluster (not yet in service). Any objection to my going ahead with it? [15:45:22] (03PS2) 10Mforns: Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) [15:53:21] !log rebooting aqs1010 [15:53:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:55:55] (03CR) 10Milimetric: Simplify RSVD anomaly detection job for Airflow POC (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [15:55:56] mforns: (sorry missed ping) no don't use pyenv [15:59:24] ok [16:00:15] !log rebooting aqs1011 [16:00:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:02:02] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 (10BTullis) 05Open→03Resolved [16:51:02] mforns, i'm having issues with poetry, mostly that it does not easily support installing shell scripts. [16:51:44] , like setup_tools scripts [16:51:51] i might switch back to setuptools [16:51:58] dunno [16:52:06] could maybe make a python based workaround... [16:55:35] ottomata: hm.. [16:57:23] Hi folks - would one of our ops people have a few minute to do https://phabricator.wikimedia.org/T297114 please? ottomata, btullis, razzi - The ticket has been waiting on the old analytics board for some time due to now corrected wrong info on wiki :S [16:58:14] joal: can do [16:58:20] <3 thank you [16:58:22] although [16:58:25] we might need approval for that [16:58:40] do we have a reference to the orignal access ticket? [16:58:52] (03CR) 10Mforns: Simplify RSVD anomaly detection job for Airflow POC (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [16:58:58] (03PS3) 10Mforns: Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) [16:59:09] hm, I don't know! if at least the ticket moves with information demand or something, I'll be happy :) [16:59:17] looking [16:59:25] they asked for kerberos here [16:59:25] https://phabricator.wikimedia.org/T295552 [16:59:37] Looks like it was just missed [16:59:38] ok cool can do [17:02:38] 10Data-Engineering, 10Data-Engineering-Kanban: Requesting Kerberos Identity - https://phabricator.wikimedia.org/T297114 (10Ottomata) This was originally requested and approved in {T295552} but not completed. I think you mean to say that your email was 'scherukuwada@wikimedia.org'. I just created your kerb... [17:05:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Create a SparkSQL runner for cluster-mode deployment - https://phabricator.wikimedia.org/T297427 (10JAllemandou) [17:06:44] (03CR) 10Milimetric: [C: 03+2] Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [17:14:27] (03Merged) 10jenkins-bot: Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [17:14:40] thanks milimetric :] [17:17:10] np, so then I'll start the build and deploy refinery too [17:17:20] btullis: I'm interested to get a ping when you have finished to reboot the cassandra-new machines please :) [17:18:25] 10Data-Engineering, 10Data-Engineering-Kanban: Requesting Kerberos Identity - https://phabricator.wikimedia.org/T297114 (10SCherukuwada) Indeed, I did mean wikimedia.org and not mediawiki.org. Thank you. [17:19:07] joal: Will do. Starting the 3rd machine out of 6 now. [17:19:16] ack - thanks btullis :) [17:19:18] !log rebooting aqs1012 [17:19:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:22:53] 10Analytics, 10Analytics-Kanban: Test snapshot-reload from all instances using pageview-top data table - https://phabricator.wikimedia.org/T291473 (10JAllemandou) [17:22:55] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10JAllemandou) 05In progress→03Resolved [17:22:57] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou) [17:24:34] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Am I right in assuming that this data has the same schema as the original `netflow`? [17:25:08] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Removed snapshots from the source servers; ` btullis@aqs1004:~$ sudo nodetool-a listsnapshots Snapshot Detail... [17:25:44] !log rebooting aqs1013 [17:25:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:50:10] (03PS1) 10Milimetric: Update changelog.md for v0.1.22 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747177 [17:50:27] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update changelog.md for v0.1.22 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747177 (owner: 10Milimetric) [17:51:04] Starting build #99 for job analytics-refinery-maven-release-docker [17:51:40] !log rebooting aqs1015 [17:51:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:58:40] joal: That's all six members of the new AQS cluster rebooted sequentially. [18:03:23] Thanks a lot btullis - will rerun loading jobs now [18:04:36] !log Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-12-13 after cluster reboot [18:04:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:05:29] Project analytics-refinery-maven-release-docker build #99: 09SUCCESS in 14 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/99/ [18:08:52] 10Data-Engineering: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10jwang) [18:12:09] disregarding pageviews_per_article_flat SLA per above [18:19:10] Starting build #58 for job analytics-refinery-update-jars-docker [18:19:43] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747185 [18:19:45] Project analytics-refinery-update-jars-docker build #58: 09SUCCESS in 35 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/58/ [18:26:29] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) Ran ` root@puppetmaster1001:~# confctl --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=... [18:29:56] 10Data-Engineering: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10Ottomata) Hm, strange In /etc/hive/conf/parquet-logging.properties: ` # Naming style for the output file: # (The output file is placed in the system temporary directory. # %u is used to provide... [18:33:16] ok I have my culprit about logging [18:34:01] joal: Great. [18:34:50] joal: I'm curious! [18:35:01] (03PS7) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [18:35:37] (03CR) 10Milimetric: [C: 03+2] Add refinery-source jars for v0.1.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747185 (owner: 10Maven-release-user) [18:35:47] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747185 (owner: 10Maven-release-user) [18:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [18:36:02] milimetric: well, we setup the logging of SQL-related spark code to warning in our log4j conf, to not overwhelm users - and obviously my code needs to be in that same package as it needs to use spark internal objects [18:36:18] milimetric: logging was actually working as expected [18:36:47] it always is, and yet... [18:37:48] this means we'll get errors if some show up, but we don't have info logging (no big deal) - I might send a CR for a puppet patch adding the new class in the logging config to change it [19:02:01] !log finished deploying the weekly train as per etherpad [19:02:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:06:55] mforns: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/1 [19:49:40] heya team, I'm back [19:53:14] o/ [19:53:21] looking into the MR [19:56:29] ottomata: approved, do you want me to merge? [19:56:53] mforns: yes please! [19:56:58] squash? [19:57:03] iunno? [19:57:12] i don't have a pref [19:57:19] probably not? [19:57:25] ok, I squashed in this case, seemed... [19:57:27] oh.. [19:57:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [19:57:30] okay! [19:57:31] i dunno! [19:57:33] well, it's too late :S [19:57:37] :) [19:57:58] this one seemed like the second commit belonged to the first one, no? [19:58:39] ok, will try to move stuff to it now. [20:01:29] k mforns i think i need a brain bounce about this conda stuff [20:01:31] if you have a min [20:01:41] yesss, bc [20:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [21:02:56] razzi: yt? got an annoying python / cli /shell issue [21:08:49] AH nm, i had a stupid bug in my code [21:08:52] wasn't a shell issue [21:19:53] (03PS1) 10DLynch: Add new EditAttemptStep integrations for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 [22:03:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [22:08:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [22:14:06] mforns: still there? :) [22:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:54:48] (03PS1) 10MewOphaswongse: Add suggestion-skip to referer_route enum for analytics/legacy/homepagevisit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747212 (https://phabricator.wikimedia.org/T297233)