[00:10:17] <wikibugs>	 (03CR) 10DLynch: "@Ottomata where do I need to bump the version for events being submitted to this? (I'm not sure how the legacy ones fit together for that." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch)
[02:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[03:28:12] <wikibugs>	 (03PS1) 10Sharvaniharan: Android MEP schema for customizing toolbar [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747226
[06:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[06:46:24] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Trivial. But I can't +2 in this codebase, unfortunately." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80)
[07:20:52] <elukey>	 !log elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics
[07:20:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:53:35] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] Add suggestion-skip to referer_route enum for analytics/legacy/homepagevisit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747212 (https://phabricator.wikimedia.org/T297233) (owner: 10MewOphaswongse)
[07:54:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add suggestion-skip to referer_route enum for analytics/legacy/homepagevisit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747212 (https://phabricator.wikimedia.org/T297233) (owner: 10MewOphaswongse)
[08:08:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) For what is worth, dbstore1007 memory in the last 30 days remains stable afte...
[08:12:04] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) Nevermind, I was looking at the wrong graph. It keeps increasing, we'll see i...
[08:51:47] <joal>	 !log Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart
[08:51:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[12:37:02] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "This version is ready to be deployed :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal)
[13:02:07] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) >>! In T263277#7570288, @JAllemandou wrote: > Am I right in assuming that this data has the same schema as the original `n...
[13:03:19] <wikibugs>	 (03PS8) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427)
[13:11:50] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) No need to detail the fields and schema :)  About data augmentation, [[ https://github.com/wikimedia/analytics-refiner...
[13:13:50] <wikibugs>	 (03PS3) 10Joal: Update wikidata_entity table create and oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740589 (https://phabricator.wikimedia.org/T258834)
[13:14:24] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 5 others: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10JAllemandou)
[13:17:37] <joal>	 btullis: Good afternoon - Would you be ready for the AQS cluster swap either now or tomorrow?
[13:18:29] <wikibugs>	 (03PS3) 10Joal: Add structured_data.commons_entity table create [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740590 (https://phabricator.wikimedia.org/T258834)
[13:21:37] <wikibugs>	 (03PS9) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427)
[13:27:57] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Cool, only `ip_version` and `region` are useful here.
[13:38:49] <btullis>	 Yes I think we could do that this afternoon, with a bit of luck.
[13:39:06] <joal>	 That'd be awesome btullis :)
[13:39:35] <joal>	 btullis: as usual I'm with kids making it not easy for me to monitor - feel free to move forward with ottomata when he joins if I'm not around
[13:42:33] <wikibugs>	 (03PS1) 10Joal: Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834)
[13:49:58] <btullis>	 joal: Will do.
[13:54:21] <wikibugs>	 (03PS4) 10Joal: Update wikidata_entity table create and oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740589 (https://phabricator.wikimedia.org/T258834)
[13:55:19] <wikibugs>	 (03PS4) 10Joal: Add structured_data.commons_entity table create [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740590 (https://phabricator.wikimedia.org/T258834)
[14:13:53] <elukey>	 what is the plan for the AQS cluster swap??
[14:16:58] <btullis>	 I'm just checking out the requirements now. I think, add a single one from the new cluster to the aqs hash in: puppet/conftool-data/node/eqiad.yaml
[14:17:28] <btullis>	 Merge the patch, which I think will pool the new host.
[14:17:47] <btullis>	 Check for traffic/errors on the new host.
[14:18:03] <btullis>	 If all clear, add the remaining hosts.
[14:18:16] <btullis>	 If all clear, depool the old hosts.
[14:18:34] <btullis>	 Remove the old hosts from conftool-data
[14:21:25] <elukey>	 +1 looks good, I'd suggest to leave the first node (or a couple) for a day or a little less just to be sure
[14:21:52] <elukey>	 for conftool the change is fine, but then you'll need to explicitly pool the node (IIRC) via confctl on puppetmaster1001
[14:22:00] <elukey>	 (after everything is merged and puppet runs etc..)
[14:24:12] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Would it hurt to keep the same augmentations?  If the schema is the sameish (it sounds like it is), we can just apply the...
[14:25:32] <btullis>	 elukey: OK, thanks. Will aim pool one or two today and do the remaining steps tomorrow.
[14:26:48] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add new EditAttemptStep integrations for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch)
[14:30:54] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7572471, @Ottomata wrote: > Would it hurt to keep the same augmentations?  If the schema is the sameish...
[14:34:47] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) I'd prefer to avoid scheduling another special job for this if we can.  Can we make the NetflowTransform functions smart...
[14:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[14:37:35] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) I have the opposite view: I'd rather have another job instead of custom logic to prevent doing something :)
[14:40:16] <wikibugs>	 (03CR) 10Svantje Lilienthal: [C: 03+1] Sanitize additional event streams [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747065 (https://phabricator.wikimedia.org/T297679) (owner: 10Awight)
[14:43:40] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Yeahh...but then we have to manage and maintain another custom ingestion job.  We're trying to reduce the number of those...
[14:51:32] <btullis>	 ottomata: Is this merge commit from the master to the debian branch OK? https://gerrit.wikimedia.org/r/c/operations/debs/druid/+/747499 
[14:52:29] <btullis>	 If you're happy then I'll merge it and start following the instructions in README.Debian to build it on deneb.
[14:52:30] <ottomata>	 btullis:  i think so? although, git-import-orig may get confused the next time it runs, i'm not sure though
[14:52:44] <ottomata>	 if it does, we can deal with that then
[14:58:24] <btullis>	 OK, thanks. I'll merge and shout if I get into a pickle :-)
[15:10:28] <ottomata>	 k ;0
[15:15:53] <btullis>	 Some really unhelpful behaviour from git here, for starters.
[15:15:57] <btullis>	 https://www.irccloud.com/pastebin/jc1rxnZ0/
[15:17:35] <mforns>	 heya team :]
[15:17:39] <ottomata>	 yoohoo
[15:17:42] <mforns>	 ottomata: just saw your ping
[15:18:00] <btullis>	 Hello mforns.
[15:18:05] <mforns>	 heyyy :]
[15:18:12] <ottomata>	 yeah got something better since talking to you yesterday, but there is one bit i'm not sure about, wanted to brain bounce it witih you
[15:18:19] <mforns>	 ottomata: sure!
[15:18:21] <mforns>	 bc?
[15:18:30] <ottomata>	 k
[15:19:30] <btullis>	 Fixed with `git checkout debian --`
[15:29:25] <ottomata>	 oh mforns , do you know...what is a good way to generate frozen-requirements.txt?  as far as I can tell pip freeze just puts all the requirements in your current python env in there, not necessarily just the ones needed by your project requires
[15:31:03] <mforns>	 hm, don't know about that
[15:37:10] <mforns>	 ottomata: pip freeze has an --exclude flag, maybe we can use that?
[15:41:33] <mforns>	 ottomata: this is a bit hacky, but maybe we can extract some idea: https://stackoverflow.com/questions/23640182/ignore-certain-packages-and-their-dependencies-with-pip-freeze
[16:10:05] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10BTullis) I notice that Atlas will need to contact a zookeeper cluster when it runs. Whilst it might be possible to include a version of zookeeper and run it on an-test-coord10...
[16:12:15] <ottomata>	 mfornslooking 
[16:13:05] <ottomata>	 mforns:  i think the answer will be: Use virtualenv and do not install unwanted packages into it
[16:13:10] <ottomata>	 (or conda env)
[16:13:20] <ottomata>	 i want an automated process to generate frozen-requirements
[16:13:24] <mforns>	 aha
[16:13:37] <mforns>	 but has to exclude spark and others right?
[16:13:40] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) @BTullis there is a test-zookeeper1002.eqiad.wmnet node, but it's not accessible from the analytics vlan. I think we should be able to punch a hole in it using somethin...
[16:13:49] <ottomata>	 so putting things into a hardcoded ignore just because you've pip installed them manaually at some point isn't quite right
[16:13:56] <ottomata>	 mforns:  that will be done with extras_requires
[16:14:06] <ottomata>	 so, just don't put pyspark in your regular install_requires
[16:14:14] <ottomata>	 it goes in the provided extras_requires
[16:14:18] <mforns>	 aha
[16:14:40] <ottomata>	 that all works; but i want a list of frozen-requirements (with all the transitive depdencies and versions) for building the conda env
[16:15:19] <mforns>	 can we pass the extra_required packages to pip freeze --exclude flag?
[16:15:30] <ottomata>	 mforns: , kinda like https://docs.npmjs.com/cli/v8/commands/npm-install
[16:15:32] <ottomata>	 The --package-lock-only argument will only update the package-lock.json, instead of checking node_modules and downloading dependencies.
[16:15:45] <ottomata>	 mforns:  thats not the problem
[16:15:56] <ottomata>	 the problem is if you have a development conda or virutalenv
[16:16:13] <ottomata>	 and at some point you do something like pip install somerandompackage
[16:16:18] <ottomata>	 then later
[16:16:19] <ottomata>	 when you run
[16:16:20] <mforns>	 yea
[16:16:20] <ottomata>	 pip freeze
[16:16:31] <ottomata>	 somerandompackage will go in frozen-requirements.txt
[16:16:54] <mforns>	 can the packaging process run pip install from scratch, and then pip freeze on top of that?
[16:17:03] <ottomata>	 yeah that would work
[16:17:14] <ottomata>	 or  sortof, something like that
[16:17:24] <ottomata>	 but we want frozen-requirements as an input to the packaging process
[16:17:37] <ottomata>	 so its sort of a standalone step
[16:17:40] <mforns>	 hm
[16:17:53] <ottomata>	 so yeah we could make the process to make frozen-requirements always use a clean build env
[16:18:19] <ottomata>	 its just a bit annoying, i'd hope that pip coudl generate its resolved dependencies without actually installing them
[16:18:54] <mforns>	 I see
[16:19:07] <ottomata>	 v
[16:19:07] <ottomata>	 https://discuss.python.org/t/anyway-to-resolve-package-dependencies-without-installation-or-download/5991
[16:19:28] <ottomata>	 https://github.com/pypa/pip/issues/53
[16:19:46] <ottomata>	 https://github.com/pypa/pip/issues/7819
[16:21:01] <ottomata>	 oo, reading: https://dustingram.com/articles/2018/03/05/why-pypi-doesnt-know-dependencies/
[16:21:51] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis)
[16:22:00] <ottomata>	 meh not that helpful
[16:22:15] <mforns>	 interesting but yea
[16:22:15] <ottomata>	 maybe 
[16:22:16] <ottomata>	 https://pypi.org/project/johnnydep/
[16:23:10] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) p:05Triage→03High
[16:24:53] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10Ottomata) We should punch a hole to test-zookeeper.  However!   > one other option is to use the zookeeper instance that is running on an-test-druid1001. @razzi this would hel...
[16:25:13] <mforns>	 heh, cool package name!
[16:27:12] <mforns>	 ottomata: looks good!
[16:27:27] <ottomata>	 mforns:  yeah i'll try it!
[16:27:46] <mforns>	 would you execute it for each one of the dependencies?
[16:29:20] <mforns>	 or do you think it'd work for workflow_utils itself via setup.cfg?
[16:30:12] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) Oh I see, I think. So queries //other than// the icinga query should still have worked?  (Although there weren't any que...
[16:55:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[16:56:00] <ottomata>	 mforns:  i'd hope it would work for setup.cfg
[16:56:09] <ottomata>	 haven't tried yet
[17:00:05] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics (Kanban): Test log file and error notification - https://phabricator.wikimedia.org/T295733 (10BTullis) a:05Mayakp.wiki→03BTullis Investigating this, as the log files aren't being created as expected.
[17:01:42] <mforns>	 ottomata: ping standuppp
[17:01:51] <ottomata>	 ty
[17:06:29] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10elukey) IIRC icinga/nagios should poke the local aqs daemon on every host, and aqs is the Cassandra client that needs to authenti...
[17:15:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[17:31:30] <joal>	 milimetric: Actually I'm facing an issue with my 'easy' stuff - you can merge/deploy the SparkSQLNoCLIDriver at will :)
[17:31:40] <milimetric>	 ok joal, will do after lunch
[17:50:03] <wikibugs>	 (03PS2) 10Joal: Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834)
[17:55:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) (owner: 10Joal)
[18:00:38] <razzi>	 milimetric: you were right! Full root and docker on public cloud. ottomata and I are going to give it a go this afternoon, let me know if you'd like to tag along and learn sre secrets
[18:01:27] <milimetric>	 razzi: cool, great.  I can do egeria if you two are doing some of the others
[18:05:24] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7572522, @Ottomata wrote: > The custom logic could even just be varied on the hardcoded stream / tablen...
[18:19:19] <wikibugs>	 (03PS1) 10Joal: Update refine netflow_augment transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747561 (https://phabricator.wikimedia.org/T263277)
[18:26:30] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster
[18:31:42] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) This is the error I am getting, I verified there are disks in the server. I also checked BIOS and it's set to auto but I do see the disks. I am n...
[18:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[18:51:49] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with...
[18:53:31] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) @papaul or @robh could you look at this and let me know what I am missing.
[18:58:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Update refine netflow_augment transform function (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747561 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal)
[18:59:20] <ottomata>	 razzi: wanna look at docker stuff sooner rather than later? i have to leave a little early today for an appointment
[18:59:58] <razzi>	 sg ottomata I’ll hop in the bc!
[19:00:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[19:00:28] <ottomata>	 razzi: lets do here in irc for a bit and hop in if we need to? that way i can focus on multiple things at once! :)
[19:00:33] <ottomata>	 but lets see!
[19:00:43] <ottomata>	 we want to get a node with docker so you can just run docker commands, right?
[19:00:49] <razzi>	 Haha ok yeah
[19:01:33] <razzi>	 So I figure I can start by creating a new data-catalog node in the data-engineering horizon ui?
[19:01:55] <ottomata>	 ya should be good
[19:02:06] <ottomata>	 i'm looking at puppet to see if that stuff makes things easier or harder
[19:02:34] <razzi>	 Is it the main puppet? imo this kind of thing could be a separate repo
[19:02:45] <razzi>	 like a mini puppet for cloud services to run docker
[19:03:12] <ottomata>	 cloud uses the main puppet too
[19:04:32] <razzi>	 yeah I guess that's the current approach, it's a bit off topic but let it be known I think the main puppet having everything across prod, test, cloud etc is a bit unwieldy
[19:04:56] <ottomata>	 on the other hand; you don't have to rewrite everything 
[19:05:06] <ottomata>	 my biggest complaint is that people don't code modules environment agnostic
[19:05:24] <ottomata>	 if that were done well, it woudn't matter, you'd be able include the profiles and set the hiera all via the UI
[19:05:28] <ottomata>	 buuuut yes, side topic
[19:05:39] <razzi>	 yeah totally, I think a best of both worlds would be the ability to import puppet modules from the a single source, then have production / cloud / any environment import what it needs
[19:05:41] <ottomata>	 razzi:  i think profile::docker::engine
[19:05:42] <ottomata>	 will help you
[19:05:52] <ottomata>	 (unless, it gets in the way : )
[19:05:54] <razzi>	 and the single source should be environment agnostic
[19:06:01] <ottomata>	 yeah
[19:06:05] <razzi>	 so is there a new horizon project, or should I use analytics?
[19:06:15] <ottomata>	 there is data engineering
[19:06:18] <ottomata>	 analytics is fine too
[19:06:19] <razzi>	 ok I found data-engineering in the dropdown :)
[19:06:21] <ottomata>	 yeah
[19:07:25] <ottomata>	 razzi:  here is an example of doing it with puppet in deployment-prep
[19:07:26] <ottomata>	 https://horizon.wikimedia.org/project/instances/94eef137-56f8-42aa-93a8-b1d8bbfef4bf/
[19:07:50] <ottomata>	 because you wan't be declaring the running service containers via puppet (i assume you'll just run docker images yourself manually)
[19:07:51] <razzi>	 for size, I'm thinking g3.cores2.ram4.disk20: 4gb ram
[19:08:14] <ottomata>	 sure, although sometimes 4gb is not enough for big java stuff like this?    but yeah
[19:08:15] <razzi>	 but 20gb disk might not be enough; the single atlas server jar is about 900m
[19:08:43] <razzi>	 for everythign more than 4gb I see: This flavor requires more RAM than your quota allows. Please select a smaller flavor or decrease the instance count.
[19:08:58] <razzi>	 and instance count is only 1
[19:09:57] <razzi>	 is there a way to get bigger ram? Or am I stuck with 4gb
[19:10:08] <RhinosF1>	 majavah: ^
[19:10:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[19:10:57] <ottomata>	 oof that sucks razzi, maybe try in analytics project then
[19:11:04] <razzi>	 haha ok let's see
[19:11:07] <ottomata>	 there is a way but we have to ask cloud
[19:11:45] <majavah>	 razzi: you can request more quota if you need it
[19:11:55] <razzi>	 analytics gives me 8gb, so that's better
[19:12:00] <razzi>	 how do I do this majavah ?
[19:12:19] <majavah>	 https://phabricator.wikimedia.org/project/view/2880/
[19:14:28] <ottomata>	 hmmm, razzi  looking more, i can't tell how much profile::docker::engine will help you
[19:14:29] <razzi>	 cool, I'll start with 8gb, hopefully that'll be enough. If it goes oom I'll look at that process majavah, thanks!
[19:14:35] <ottomata>	 it might be worth trying https://docs.docker.com/engine/install/debian/ first
[19:14:47] <ottomata>	 profile::docker::engine doesn't do much but install a vetted docker debian package
[19:15:03] <ottomata>	 it does set a couple of grub::bootparams (?) 
[19:15:16] <ottomata>	 which may or may not be necessary
[19:15:20] <razzi>	 huh
[19:15:58] <majavah>	 iirc the data-engineering project was created with the expectation that it will eventually replace the analytics one :/
[19:16:43] <ottomata>	 hmm, service::docker could help you, but it is hardcoded to only use the wikimedia docker registry
[19:16:45] <ottomata>	 so you can't use it
[19:16:47] <ottomata>	 so
[19:17:00] <ottomata>	 but, checkout the systemd unit it declares to run the docker container
[19:17:05] <ottomata>	 service/docker-service-shim.erb
[19:17:25] <ottomata>	 it has the docker run command i think you'll need to get the ports exposed 
[19:19:30] <razzi>	 ok cool, I see the script, looks pretty simple
[19:20:40] <razzi>	 majavah: yeah we should probably clean up what currently exists in the analytics project, I'm not currently using anything that's there. Take heart that the data catalog evaluation (what I just created) is not expected to last more than a few months
[19:21:20] <razzi>	 Maybe we can up the limits for data-engineering, and then make the analytics limit 0 meaning "don't create anything new here" :)
[19:24:59] <btullis>	 Yeah, when I asked to create the data-engineering project there was a bit of push-back about having two umbrella projects. So I specifically said let's start with low quotas and we can increase them over time as stuff from analytics might get decommissioned.
[19:37:37] <razzi>	 Really simple cloud vps question for ottomata or anybody else: when I do `sudo apt update && sudo apt upgrade` I get: `E: The repository 'http://security.debian.org bullseye/updates Release' does not have a Release file. N: Updating from such a repository can't be done securely, and is therefore disabled by default.`
[19:38:45] <ottomata>	 hmm, maybe no need to run upgrade?
[19:40:41] <razzi>	 yeah I guess that's not necessary
[19:40:57] <razzi>	 Was just following an "install docker on debian" tutorial and they started with that
[19:44:16] <razzi>	 ok cool https://wikitech.wikimedia.org/wiki/Docker is nice and to-the-point
[19:45:41] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster
[19:49:39] <razzi>	 ok cool docker is running! ottomata you said I could access the cloud vps host on a public domain?
[19:50:35] <razzi>	 I did `sudo docker run -it nginx -p 80:80` and I can access it locally
[19:50:54] <razzi>	 so I can probably do ssh forwarding at this point, but sharable urls are nice :)
[19:51:05] <ottomata>	 yes...
[19:51:28] <ottomata>	 https://horizon.wikimedia.org/project/proxy/
[19:51:47] <ottomata>	 pretty straighforward
[19:51:59] <ottomata>	 create a proxy to your instance
[19:52:30] <razzi>	 Ok cool that worked! data-catalog-evaluation.wmcloud.org 
[19:52:34] <razzi>	 ty ottomata 
[19:52:44] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with err...
[19:55:44] <ottomata>	 nice!
[20:06:11] <razzi>	 The docker image for atlas unfortunately fails due to the same reason mvn install wasn't working on the test cluster; expired certificate for https://maven.restlet.com
[20:06:44] <razzi>	 (it expired last Saturday)
[20:11:55] <ottomata>	 oh rats
[20:15:33] <razzi>	 I'll create a ticket to track that issue since it's apparently not going away by itself :)
[20:22:50] <ottomata>	 yeah
[20:22:56] <ottomata>	 mforns: yt?
[20:23:07] <mforns>	 ottomata: yess
[20:23:09] <mforns>	 sup
[20:23:37] <ottomata>	 bb bc real quick?
[20:24:07] <mforns>	 omw
[20:38:40] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:42:55] <razzi>	 Hm let me see about this alert
[20:44:14] <razzi>	 1-12-15T20:30:31.023 INFO ProduceCanaryEvents Succeeded producing canary events to  94 / 95 streams.
[20:44:14] <razzi>	 1-12-15T20:30:31.023 ERROR ProduceCanaryEvents Encountered failures when producing canary events to 1 / 95 streams.
[20:45:56] <razzi>	 It says in the source https://gerrit.wikimedia.org/g/analytics/refinery/source/+/df129ab5813669e7ea915d135726ebdc2caa7da4/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ProduceCanaryEvents.scala `// Failures have already been logged.` but I'm not sure where; ottomata do you know?
[20:47:44] <ottomata>	 razzi:  i don't know why it does that sometimes, but it seems to be intermittent. i was wondering if there might be some inconsistency in one of the running eventgate's cached stream configs or schemas
[20:49:50] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:50:41] <razzi>	 huh ok there's the recovery
[20:59:40] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] "merging to deploy as part of 0.1.23" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal)
[21:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal)
[21:10:47] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10razzi)
[21:12:59] <wikibugs>	 (03PS1) 10Milimetric: Update changelog.md for v0.1.23 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747615
[21:13:15] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Update changelog.md for v0.1.23 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747615 (owner: 10Milimetric)
[21:13:20] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update changelog.md for v0.1.23 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747615 (owner: 10Milimetric)
[21:13:59] <wmf-insecte>	 Starting build #100 for job analytics-refinery-maven-release-docker
[21:32:18] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #100: 09SUCCESS in 18 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/100/
[22:28:14] <wmf-insecte>	 Starting build #59 for job analytics-refinery-update-jars-docker
[22:28:46] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.23 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747620
[22:28:47] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #59: 09SUCCESS in 32 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/59/
[22:35:42] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.23 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747620 (owner: 10Maven-release-user)
[22:36:00] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[22:48:17] <wikibugs>	 10Analytics, 10Readers-Web-Backlog (Needs Prioritization (Tech)), 10Wikimedia-production-error: eventgate_validation_error: '.web_session_id' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T297521 (10Jdlrobson) Analytics team.. any idea what could be going on here?