[01:04:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[01:09:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[01:16:13] <wikibugs>	 (03PS1) 10Milimetric: Add bjn.wikibooks to the pageview allow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/746982
[01:16:29] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add bjn.wikibooks to the pageview allow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/746982 (owner: 10Milimetric)
[02:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[06:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[08:59:35] <joal>	 Hi btullis - it seems aqs new cluster is in a weird state - loading jobs for the two big tables have failed :(
[09:00:15] <joal>	 actually, not yet failed, but are very slow (still running), with failed tasks
[13:23:16] <joal>	 btullis, elukey - I see a lot of socket-errors on AQS-new hosts - this feels weird :(
[13:24:26] <joal>	 btullis: we laso have a very high file-system usage - we were expecting a lot less - could it be related to snapshots taken not released or anything like that?
[13:24:29] <joal>	 (new hosts)
[13:48:10] <btullis>	 joal: Thanks. investigating now. I don't think that we've created any snapshots on the new hosts.
[13:50:22] <btullis>	 I need to do a rolling reboot of the whole aqs_next cluster at some point, but I will investigate this first. Also I need to do this: T297483
[13:50:22] <stashbot>	 T297483: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483
[14:12:08] <gehel>	 joal: do you need more help on that failed build?
[14:15:53] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi)
[14:19:33] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Tests are successful:  I tested it by configuring sflow on the non-yet-prod asw1-b12-drmrs switch: `lang=diff [edit protoc...
[14:23:15] <joal>	 gehel: no thank you, all good :)
[14:23:22] <gehel>	 cool!
[14:25:35] <btullis>	 !log btullis@aqs1011:$ sudo systemctl start cassandra-b.service 
[14:25:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:34:39] <wikibugs>	 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team: Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) This is this kind of thing we need to have a way to reconcile: https://wikitech.wikimedia.org/wiki/Incident_...
[14:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[14:36:36] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as a Source of Truth - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) Perhaps a better titled would be "[[ https://martinfowler.com/articles/201701-event-driven.h...
[14:42:54] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) From wikitech: https://wikitech.wikimedia.org/wiki/Cassandra#Replicating_system_auth > Authentication and authorization...
[14:46:57] <mforns>	 ottomata: yt? I have some problems with poetry and pyarrow/numpy, did you have the same issues?
[14:47:13] <mforns>	 (with worflow_utils repo)
[14:52:38] <ottomata>	 mforns:  yes hello!
[14:52:52] <mforns>	 heya :]
[14:52:52] <ottomata>	 mforns:  i don't think so, but i didn't test extensiveily
[14:53:03] <ottomata>	 mostly that things were just installed, what's the issue?
[14:54:05] <mforns>	 when I try to poetry install it conflicts: pyarrow's only version? 0.16.0 forces a version of numpy that does not support python3.7
[14:54:22] <mforns>	 don't worry, will continue trying here
[14:54:31] <mforns>	 just wondering if you had seen it before
[14:55:39] <mforns>	 wait I pasted all the wrong versions
[14:58:18] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) I can't see any issues with the current process either, apart from the fact that possibly we didn't use the `--full` opt...
[14:59:18] <btullis>	 !log cassandra@cqlsh> ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; on aqs1010-a
[14:59:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:59:32] <ottomata>	 heheh
[14:59:36] <ottomata>	 was trying to parse mforns
[14:59:47] <ottomata>	 pyarrow 6.0.1
[14:59:47] <ottomata>	 and
[14:59:53] <ottomata>	 numpy 1.21.1
[14:59:54] <ottomata>	 right?
[15:00:01] <mforns>	 ok, poetry says pyarrow 6.0.1   is not compatible with numpy 1.21.1
[15:00:02] <ottomata>	 why is numpy even in there?
[15:00:25] <mforns>	 I think it's a pyarrow dependency, no?
[15:00:29] <ottomata>	 oh is it?
[15:00:37] <btullis>	 !log btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth
[15:00:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:01:15] <ottomata>	 pyarrow says
[15:01:15] <ottomata>	 numpy = ">=1.16.6"
[15:01:25] <mforns>	 yes
[15:01:28] <ottomata>	 mforns:  got some code i can try and repro?
[15:01:36] <ottomata>	 or is it when you install?
[15:01:43] <mforns>	 the poetry.lock file looks good
[15:01:47] <mforns>	 the error message also says: The current project's Python requirement (3.6.9)...
[15:01:57] <mforns>	 which is weird, where is that specified, it should be 3.7
[15:02:05] <ottomata>	 mforns: 
[15:02:06] <ottomata>	 python --version
[15:02:07] <ottomata>	 and
[15:02:10] <ottomata>	 poetry run python --version
[15:02:10] <ottomata>	 ?
[15:02:14] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) This completed successfully: Now beginning to repair the system_auth table. ` btullis@aqs1010:~$ sudo nodetool-a repair...
[15:02:42] <mforns>	 ottomata: shouldn't it be:  python3 --version?
[15:02:59] <mforns>	 Python 3.7.5
[15:02:59] <ottomata>	 mforns:  depends on your env, unlikely if using poetry venv or conda
[15:03:50] <mforns>	 https://www.irccloud.com/pastebin/CiJEV8At/
[15:05:18] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) Similarly the repair for aqs1010-b completed successfully. ` btullis@aqs1010:~$ sudo nodetool-b repair --full system_aut...
[15:06:59] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return
[15:06:59] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:07:07] <ottomata>	 oh interesting mforns 
[15:07:09] <ottomata>	 that means your 
[15:07:18] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return
[15:07:18] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:07:21] <ottomata>	 mforns:  do poetry run which python
[15:07:39] <btullis>	 ^ Oh, this alert is down to what I am doing. I will ack it now.
[15:07:47] <mforns>	  /home/mforns/.cache/pypoetry/virtualenvs/workflow-utils-1T6eci4A-py3.7/bin/python
[15:07:58] <ottomata>	 wipe that 
[15:07:59] <mforns>	 hm
[15:08:00] <ottomata>	 rm -rf  /home/mforns/.cache/pypoetry/virtualenvs/workflow-utils-1T6eci4A-py3.7
[15:08:04] <ottomata>	 something is wrong :)
[15:08:06] <ottomata>	 or maybe
[15:08:07] <ottomata>	 actually
[15:08:11] <ottomata>	 maybe you can do
[15:08:29] <ottomata>	 poetry add python=3.7.9 
[15:08:35] <ottomata>	 or is it python==3.7.9
[15:09:41] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views return
[15:09:41] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:10:32] <mforns>	 not working..
[15:10:41] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:10:49] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:11:19] <ottomata>	 mforns: ?
[15:11:30] <mforns>	 reading https://github.com/python-poetry/poetry/issues/655
[15:11:33] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:17:55] <ottomata>	 hm
[15:21:21] <mforns>	 ottomata: do you use pyenv?
[15:26:17] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) I ended up not doing the remaining 10 repairs with cumin, but manually. We started getting 500 errors shortly after carr...
[15:27:18] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) The 500 errors stopped shotly after the repair commands were issued on aqs1011, but there's still no definitive answer a...
[15:27:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) p:05Triage→03High
[15:28:01] <milimetric>	 HEAR YE HEAR YE
[15:28:16] <milimetric>	 The weekly deployment train is boarding
[15:28:41] <mforns>	 heya milimetric :]
[15:28:43] <milimetric>	 hi!
[15:28:59] <milimetric>	 there's only a sanitize allow list change on there, feel free to add stuff, I can deploy after meetings
[15:29:05] <mforns>	 can I add the changes to RSVDAnomalyDetection to the train?
[15:29:17] <btullis>	 Nothing from me, thanks milimetric.
[15:29:41] <milimetric>	 mforns: I'll hold the train as long as you need.  Do you need review?
[15:30:37] <mforns>	 milimetric: the thing is tested and retested, but still needs a review, the deployment will be a no-op, because no oozie job is using that version of the RSVD code (this version is for airflow)
[15:30:47] <mforns>	 milimetric: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/707517
[15:31:02] <wikibugs>	 (03CR) 10Mforns: "This change is ready for review." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns)
[15:31:27] <milimetric>	 mforns: but this has to get merged too, no? (https://gerrit.wikimedia.org/r/c/analytics/refinery/+/702668)
[15:31:54] <mforns>	 milimetric: no no, that will go somewhere else!
[15:32:08] <milimetric>	 batcave?  maybe I can help, ops week is awfully quiet
[15:32:49] <mforns>	 ok
[15:35:35] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10elukey) >>! In T297483#7569667, @BTullis wrote: > The 500 errors stopped shotly after the repair commands were issued on aqs1011,...
[15:35:44] <wikibugs>	 (03CR) 10Mforns: "This can be merged now, independently from https://gerrit.wikimedia.org/r/c/analytics/refinery/+/702668" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns)
[15:38:56] <btullis>	 I'm planning to do a rolling reboot of the cassandra servers in the aqs_next cluster (not yet in service). Any objection to my going ahead with it?
[15:45:22] <wikibugs>	 (03PS2) 10Mforns: Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692)
[15:53:21] <btullis>	 !log rebooting aqs1010
[15:53:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:55:55] <wikibugs>	 (03CR) 10Milimetric: Simplify RSVD anomaly detection job for Airflow POC (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns)
[15:55:56] <ottomata>	 mforns:  (sorry missed ping) no don't use pyenv
[15:59:24] <mforns>	 ok
[16:00:15] <btullis>	 !log rebooting aqs1011
[16:00:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:02:02] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Investigate Superset Druid Timeouts - https://phabricator.wikimedia.org/T297148 (10BTullis) 05Open→03Resolved
[16:51:02] <ottomata>	 mforns,  i'm having issues with poetry, mostly that it does not easily support installing shell scripts.
[16:51:44] <ottomata>	 , like setup_tools scripts
[16:51:51] <ottomata>	 i might switch back to setuptools
[16:51:58] <ottomata>	 dunno
[16:52:06] <ottomata>	 could maybe make a python based workaround...
[16:55:35] <mforns>	 ottomata: hm..
[16:57:23] <joal>	 Hi folks - would one of our ops people have a few minute to do https://phabricator.wikimedia.org/T297114 please? ottomata, btullis, razzi - The ticket has been waiting on the old analytics board for some time due to now corrected wrong info on wiki :S
[16:58:14] <ottomata>	 joal:  can do
[16:58:20] <joal>	 <3 thank you
[16:58:22] <ottomata>	 although
[16:58:25] <ottomata>	 we might need approval for that
[16:58:40] <ottomata>	 do we have a reference to the orignal access ticket?
[16:58:52] <wikibugs>	 (03CR) 10Mforns: Simplify RSVD anomaly detection job for Airflow POC (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns)
[16:58:58] <wikibugs>	 (03PS3) 10Mforns: Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692)
[16:59:09] <joal>	 hm, I don't know! if at least the ticket moves with information demand or something, I'll be happy :)
[16:59:17] <ottomata>	 looking
[16:59:25] <ottomata>	 they asked for kerberos here
[16:59:25] <ottomata>	 https://phabricator.wikimedia.org/T295552
[16:59:37] <ottomata>	 Looks like it was just missed
[16:59:38] <ottomata>	 ok cool can do
[17:02:38] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Requesting Kerberos Identity - https://phabricator.wikimedia.org/T297114 (10Ottomata) This was originally requested and approved in {T295552} but not completed.   I think you mean to say that your email was 'scherukuwada@wikimedia.org'.    I just created your kerb...
[17:05:11] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Create a SparkSQL runner for cluster-mode deployment - https://phabricator.wikimedia.org/T297427 (10JAllemandou)
[17:06:44] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns)
[17:14:27] <wikibugs>	 (03Merged) 10jenkins-bot: Simplify RSVD anomaly detection job for Airflow POC [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/707517 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns)
[17:14:40] <mforns>	 thanks milimetric :]
[17:17:10] <milimetric>	 np, so then I'll start the build and deploy refinery too
[17:17:20] <joal>	 btullis: I'm interested to get a ping when you have finished to reboot the cassandra-new machines please :)
[17:18:25] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Requesting Kerberos Identity - https://phabricator.wikimedia.org/T297114 (10SCherukuwada) Indeed, I did mean wikimedia.org and not mediawiki.org. Thank you.
[17:19:07] <btullis>	 joal: Will do. Starting the 3rd machine out of 6 now.
[17:19:16] <joal>	 ack - thanks btullis :)
[17:19:18] <btullis>	 !log rebooting aqs1012
[17:19:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:22:53] <wikibugs>	 10Analytics, 10Analytics-Kanban: Test snapshot-reload from all instances using pageview-top data table - https://phabricator.wikimedia.org/T291473 (10JAllemandou)
[17:22:55] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10JAllemandou) 05In progress→03Resolved
[17:22:57] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou)
[17:24:34] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Am I right in assuming that this data has the same schema as the original `netflow`?
[17:25:08] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Removed snapshots from the source servers; ` btullis@aqs1004:~$ sudo nodetool-a listsnapshots Snapshot Detail...
[17:25:44] <btullis>	 !log rebooting aqs1013
[17:25:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:50:10] <wikibugs>	 (03PS1) 10Milimetric: Update changelog.md for v0.1.22 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747177
[17:50:27] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update changelog.md for v0.1.22 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747177 (owner: 10Milimetric)
[17:51:04] <wmf-insecte>	 Starting build #99 for job analytics-refinery-maven-release-docker
[17:51:40] <btullis>	 !log rebooting aqs1015
[17:51:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:58:40] <btullis>	 joal: That's all six members of the new AQS cluster rebooted sequentially.
[18:03:23] <joal>	 Thanks a lot btullis - will rerun loading jobs now
[18:04:36] <joal>	 !log Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-12-13 after cluster reboot
[18:04:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:05:29] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #99: 09SUCCESS in 14 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/99/
[18:08:52] <wikibugs>	 10Data-Engineering: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10jwang)
[18:12:09] <milimetric>	 disregarding pageviews_per_article_flat SLA per above
[18:19:10] <wmf-insecte>	 Starting build #58 for job analytics-refinery-update-jars-docker
[18:19:43] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747185
[18:19:45] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #58: 09SUCCESS in 35 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/58/
[18:26:29] <wikibugs>	 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) Ran  ` root@puppetmaster1001:~# confctl --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=...
[18:29:56] <wikibugs>	 10Data-Engineering: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10Ottomata) Hm, strange In /etc/hive/conf/parquet-logging.properties:  ` # Naming style for the output file: # (The output file is placed in the system temporary directory. # %u is used to provide...
[18:33:16] <joal>	 ok I have my culprit about logging
[18:34:01] <btullis>	 joal: Great.
[18:34:50] <milimetric>	 joal: I'm curious!
[18:35:01] <wikibugs>	 (03PS7) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427)
[18:35:37] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Add refinery-source jars for v0.1.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747185 (owner: 10Maven-release-user)
[18:35:47] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747185 (owner: 10Maven-release-user)
[18:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[18:36:02] <joal>	 milimetric: well, we setup the logging of SQL-related spark code to warning in our log4j conf, to not overwhelm users - and obviously my code needs to be in that same package as it needs to use spark internal objects
[18:36:18] <joal>	 milimetric: logging was actually working as expected
[18:36:47] <milimetric>	 it always is, and yet...
[18:37:48] <joal>	 this means we'll get errors if some show up, but we don't have info logging (no big deal) - I might send a CR for a puppet patch adding the new class in the logging config to change it
[19:02:01] <milimetric>	 !log finished deploying the weekly train as per etherpad
[19:02:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:06:55] <ottomata>	 mforns: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/1
[19:49:40] <mforns>	 heya team, I'm back
[19:53:14] <ottomata>	 o/
[19:53:21] <mforns>	 looking into the MR
[19:56:29] <mforns>	 ottomata: approved, do you want me to merge?
[19:56:53] <ottomata>	 mforns:  yes please!
[19:56:58] <mforns>	 squash?
[19:57:03] <ottomata>	 iunno?
[19:57:12] <ottomata>	 i don't have a pref
[19:57:19] <ottomata>	 probably not?
[19:57:25] <mforns>	 ok, I squashed in this case, seemed...
[19:57:27] <mforns>	 oh..
[19:57:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[19:57:30] <ottomata>	 okay!
[19:57:31] <ottomata>	 i dunno!
[19:57:33] <mforns>	 well, it's too late :S
[19:57:37] <ottomata>	 :)
[19:57:58] <mforns>	 this one seemed like the second commit belonged to the first one, no?
[19:58:39] <mforns>	 ok, will try to move stuff to it now.
[20:01:29] <ottomata>	 k mforns  i think i need a brain bounce about this conda stuff
[20:01:31] <ottomata>	 if you have a min
[20:01:41] <mforns>	 yesss, bc
[20:12:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[21:02:56] <ottomata>	 razzi: yt? got an annoying python / cli /shell issue
[21:08:49] <ottomata>	 AH nm, i had a stupid bug in my code
[21:08:52] <ottomata>	 wasn't a shell issue
[21:19:53] <wikibugs>	 (03PS1) 10DLynch: Add new EditAttemptStep integrations for mobile apps [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205
[22:03:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[22:08:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[22:14:06] <ottomata>	 mforns:  still there? :)
[22:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[22:54:48] <wikibugs>	 (03PS1) 10MewOphaswongse: Add suggestion-skip to referer_route enum for analytics/legacy/homepagevisit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747212 (https://phabricator.wikimedia.org/T297233)