[00:10:17] (03CR) 10DLynch: "@Ottomata where do I need to bump the version for events being submitted to this? (I'm not sure how the legacy ones fit together for that." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch) [02:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [03:28:12] (03PS1) 10Sharvaniharan: Android MEP schema for customizing toolbar [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747226 [06:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:46:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Trivial. But I can't +2 in this codebase, unfortunately." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742234 (owner: 10Amire80) [07:20:52] !log elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics [07:20:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:53:35] (03CR) 10Kosta Harlan: [C: 03+2] Add suggestion-skip to referer_route enum for analytics/legacy/homepagevisit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747212 (https://phabricator.wikimedia.org/T297233) (owner: 10MewOphaswongse) [07:54:14] (03Merged) 10jenkins-bot: Add suggestion-skip to referer_route enum for analytics/legacy/homepagevisit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747212 (https://phabricator.wikimedia.org/T297233) (owner: 10MewOphaswongse) [08:08:25] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) For what is worth, dbstore1007 memory in the last 30 days remains stable afte... [08:12:04] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) Nevermind, I was looking at the wrong graph. It keeps increasing, we'll see i... [08:51:47] !log Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart [08:51:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [12:37:02] (03CR) 10Joal: [C: 03+1] "This version is ready to be deployed :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [13:02:07] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) >>! In T263277#7570288, @JAllemandou wrote: > Am I right in assuming that this data has the same schema as the original `n... [13:03:19] (03PS8) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [13:11:50] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) No need to detail the fields and schema :) About data augmentation, [[ https://github.com/wikimedia/analytics-refiner... [13:13:50] (03PS3) 10Joal: Update wikidata_entity table create and oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740589 (https://phabricator.wikimedia.org/T258834) [13:14:24] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 5 others: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10JAllemandou) [13:17:37] btullis: Good afternoon - Would you be ready for the AQS cluster swap either now or tomorrow? [13:18:29] (03PS3) 10Joal: Add structured_data.commons_entity table create [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740590 (https://phabricator.wikimedia.org/T258834) [13:21:37] (03PS9) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [13:27:57] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Cool, only `ip_version` and `region` are useful here. [13:38:49] Yes I think we could do that this afternoon, with a bit of luck. [13:39:06] That'd be awesome btullis :) [13:39:35] btullis: as usual I'm with kids making it not easy for me to monitor - feel free to move forward with ottomata when he joins if I'm not around [13:42:33] (03PS1) 10Joal: Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) [13:49:58] joal: Will do. [13:54:21] (03PS4) 10Joal: Update wikidata_entity table create and oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740589 (https://phabricator.wikimedia.org/T258834) [13:55:19] (03PS4) 10Joal: Add structured_data.commons_entity table create [analytics/refinery] - 10https://gerrit.wikimedia.org/r/740590 (https://phabricator.wikimedia.org/T258834) [14:13:53] what is the plan for the AQS cluster swap?? [14:16:58] I'm just checking out the requirements now. I think, add a single one from the new cluster to the aqs hash in: puppet/conftool-data/node/eqiad.yaml [14:17:28] Merge the patch, which I think will pool the new host. [14:17:47] Check for traffic/errors on the new host. [14:18:03] If all clear, add the remaining hosts. [14:18:16] If all clear, depool the old hosts. [14:18:34] Remove the old hosts from conftool-data [14:21:25] +1 looks good, I'd suggest to leave the first node (or a couple) for a day or a little less just to be sure [14:21:52] for conftool the change is fine, but then you'll need to explicitly pool the node (IIRC) via confctl on puppetmaster1001 [14:22:00] (after everything is merged and puppet runs etc..) [14:24:12] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Would it hurt to keep the same augmentations? If the schema is the sameish (it sounds like it is), we can just apply the... [14:25:32] elukey: OK, thanks. Will aim pool one or two today and do the remaining steps tomorrow. [14:26:48] (03CR) 10Ottomata: [C: 03+1] Add new EditAttemptStep integrations for mobile apps (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/747205 (owner: 10DLynch) [14:30:54] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7572471, @Ottomata wrote: > Would it hurt to keep the same augmentations? If the schema is the sameish... [14:34:47] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) I'd prefer to avoid scheduling another special job for this if we can. Can we make the NetflowTransform functions smart... [14:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:37:35] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) I have the opposite view: I'd rather have another job instead of custom logic to prevent doing something :) [14:40:16] (03CR) 10Svantje Lilienthal: [C: 03+1] Sanitize additional event streams [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747065 (https://phabricator.wikimedia.org/T297679) (owner: 10Awight) [14:43:40] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Yeahh...but then we have to manage and maintain another custom ingestion job. We're trying to reduce the number of those... [14:51:32] ottomata: Is this merge commit from the master to the debian branch OK? https://gerrit.wikimedia.org/r/c/operations/debs/druid/+/747499 [14:52:29] If you're happy then I'll merge it and start following the instructions in README.Debian to build it on deneb. [14:52:30] btullis: i think so? although, git-import-orig may get confused the next time it runs, i'm not sure though [14:52:44] if it does, we can deal with that then [14:58:24] OK, thanks. I'll merge and shout if I get into a pickle :-) [15:10:28] k ;0 [15:15:53] Some really unhelpful behaviour from git here, for starters. [15:15:57] https://www.irccloud.com/pastebin/jc1rxnZ0/ [15:17:35] heya team :] [15:17:39] yoohoo [15:17:42] ottomata: just saw your ping [15:18:00] Hello mforns. [15:18:05] heyyy :] [15:18:12] yeah got something better since talking to you yesterday, but there is one bit i'm not sure about, wanted to brain bounce it witih you [15:18:19] ottomata: sure! [15:18:21] bc? [15:18:30] k [15:19:30] Fixed with `git checkout debian --` [15:29:25] oh mforns , do you know...what is a good way to generate frozen-requirements.txt? as far as I can tell pip freeze just puts all the requirements in your current python env in there, not necessarily just the ones needed by your project requires [15:31:03] hm, don't know about that [15:37:10] ottomata: pip freeze has an --exclude flag, maybe we can use that? [15:41:33] ottomata: this is a bit hacky, but maybe we can extract some idea: https://stackoverflow.com/questions/23640182/ignore-certain-packages-and-their-dependencies-with-pip-freeze [16:10:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10BTullis) I notice that Atlas will need to contact a zookeeper cluster when it runs. Whilst it might be possible to include a version of zookeeper and run it on an-test-coord10... [16:12:15] mfornslooking [16:13:05] mforns: i think the answer will be: Use virtualenv and do not install unwanted packages into it [16:13:10] (or conda env) [16:13:20] i want an automated process to generate frozen-requirements [16:13:24] aha [16:13:37] but has to exclude spark and others right? [16:13:40] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) @BTullis there is a test-zookeeper1002.eqiad.wmnet node, but it's not accessible from the analytics vlan. I think we should be able to punch a hole in it using somethin... [16:13:49] so putting things into a hardcoded ignore just because you've pip installed them manaually at some point isn't quite right [16:13:56] mforns: that will be done with extras_requires [16:14:06] so, just don't put pyspark in your regular install_requires [16:14:14] it goes in the provided extras_requires [16:14:18] aha [16:14:40] that all works; but i want a list of frozen-requirements (with all the transitive depdencies and versions) for building the conda env [16:15:19] can we pass the extra_required packages to pip freeze --exclude flag? [16:15:30] mforns: , kinda like https://docs.npmjs.com/cli/v8/commands/npm-install [16:15:32] The --package-lock-only argument will only update the package-lock.json, instead of checking node_modules and downloading dependencies. [16:15:45] mforns: thats not the problem [16:15:56] the problem is if you have a development conda or virutalenv [16:16:13] and at some point you do something like pip install somerandompackage [16:16:18] then later [16:16:19] when you run [16:16:20] yea [16:16:20] pip freeze [16:16:31] somerandompackage will go in frozen-requirements.txt [16:16:54] can the packaging process run pip install from scratch, and then pip freeze on top of that? [16:17:03] yeah that would work [16:17:14] or sortof, something like that [16:17:24] but we want frozen-requirements as an input to the packaging process [16:17:37] so its sort of a standalone step [16:17:40] hm [16:17:53] so yeah we could make the process to make frozen-requirements always use a clean build env [16:18:19] its just a bit annoying, i'd hope that pip coudl generate its resolved dependencies without actually installing them [16:18:54] I see [16:19:07] v [16:19:07] https://discuss.python.org/t/anyway-to-resolve-package-dependencies-without-installation-or-download/5991 [16:19:28] https://github.com/pypa/pip/issues/53 [16:19:46] https://github.com/pypa/pip/issues/7819 [16:21:01] oo, reading: https://dustingram.com/articles/2018/03/05/why-pypi-doesnt-know-dependencies/ [16:21:51] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) [16:22:00] meh not that helpful [16:22:15] interesting but yea [16:22:15] maybe [16:22:16] https://pypi.org/project/johnnydep/ [16:23:10] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) p:05Triage→03High [16:24:53] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10Ottomata) We should punch a hole to test-zookeeper. However! > one other option is to use the zookeeper instance that is running on an-test-druid1001. @razzi this would hel... [16:25:13] heh, cool package name! [16:27:12] ottomata: looks good! [16:27:27] mforns: yeah i'll try it! [16:27:46] would you execute it for each one of the dependencies? [16:29:20] or do you think it'd work for workflow_utils itself via setup.cfg? [16:30:12] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) Oh I see, I think. So queries //other than// the icinga query should still have worked? (Although there weren't any que... [16:55:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [16:56:00] mforns: i'd hope it would work for setup.cfg [16:56:09] haven't tried yet [17:00:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics (Kanban): Test log file and error notification - https://phabricator.wikimedia.org/T295733 (10BTullis) a:05Mayakp.wiki→03BTullis Investigating this, as the log files aren't being created as expected. [17:01:42] ottomata: ping standuppp [17:01:51] ty [17:06:29] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10elukey) IIRC icinga/nagios should poke the local aqs daemon on every host, and aqs is the Cassandra client that needs to authenti... [17:15:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [17:31:30] milimetric: Actually I'm facing an issue with my 'easy' stuff - you can merge/deploy the SparkSQLNoCLIDriver at will :) [17:31:40] ok joal, will do after lunch [17:50:03] (03PS2) 10Joal: Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) [17:55:23] (03CR) 10jerkins-bot: [V: 04-1] Update structured_data dumps parsing job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747508 (https://phabricator.wikimedia.org/T258834) (owner: 10Joal) [18:00:38] milimetric: you were right! Full root and docker on public cloud. ottomata and I are going to give it a go this afternoon, let me know if you'd like to tag along and learn sre secrets [18:01:27] razzi: cool, great. I can do egeria if you two are doing some of the others [18:05:24] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7572522, @Ottomata wrote: > The custom logic could even just be varied on the hardcoded stream / tablen... [18:19:19] (03PS1) 10Joal: Update refine netflow_augment transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747561 (https://phabricator.wikimedia.org/T263277) [18:26:30] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster [18:31:42] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) This is the error I am getting, I verified there are disks in the server. I also checked BIOS and it's set to auto but I do see the disks. I am n... [18:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [18:51:49] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with... [18:53:31] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) @papaul or @robh could you look at this and let me know what I am missing. [18:58:53] (03CR) 10Ottomata: [C: 03+1] Update refine netflow_augment transform function (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747561 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [18:59:20] razzi: wanna look at docker stuff sooner rather than later? i have to leave a little early today for an appointment [18:59:58] sg ottomata I’ll hop in the bc! [19:00:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [19:00:28] razzi: lets do here in irc for a bit and hop in if we need to? that way i can focus on multiple things at once! :) [19:00:33] but lets see! [19:00:43] we want to get a node with docker so you can just run docker commands, right? [19:00:49] Haha ok yeah [19:01:33] So I figure I can start by creating a new data-catalog node in the data-engineering horizon ui? [19:01:55] ya should be good [19:02:06] i'm looking at puppet to see if that stuff makes things easier or harder [19:02:34] Is it the main puppet? imo this kind of thing could be a separate repo [19:02:45] like a mini puppet for cloud services to run docker [19:03:12] cloud uses the main puppet too [19:04:32] yeah I guess that's the current approach, it's a bit off topic but let it be known I think the main puppet having everything across prod, test, cloud etc is a bit unwieldy [19:04:56] on the other hand; you don't have to rewrite everything [19:05:06] my biggest complaint is that people don't code modules environment agnostic [19:05:24] if that were done well, it woudn't matter, you'd be able include the profiles and set the hiera all via the UI [19:05:28] buuuut yes, side topic [19:05:39] yeah totally, I think a best of both worlds would be the ability to import puppet modules from the a single source, then have production / cloud / any environment import what it needs [19:05:41] razzi: i think profile::docker::engine [19:05:42] will help you [19:05:52] (unless, it gets in the way : ) [19:05:54] and the single source should be environment agnostic [19:06:01] yeah [19:06:05] so is there a new horizon project, or should I use analytics? [19:06:15] there is data engineering [19:06:18] analytics is fine too [19:06:19] ok I found data-engineering in the dropdown :) [19:06:21] yeah [19:07:25] razzi: here is an example of doing it with puppet in deployment-prep [19:07:26] https://horizon.wikimedia.org/project/instances/94eef137-56f8-42aa-93a8-b1d8bbfef4bf/ [19:07:50] because you wan't be declaring the running service containers via puppet (i assume you'll just run docker images yourself manually) [19:07:51] for size, I'm thinking g3.cores2.ram4.disk20: 4gb ram [19:08:14] sure, although sometimes 4gb is not enough for big java stuff like this? but yeah [19:08:15] but 20gb disk might not be enough; the single atlas server jar is about 900m [19:08:43] for everythign more than 4gb I see: This flavor requires more RAM than your quota allows. Please select a smaller flavor or decrease the instance count. [19:08:58] and instance count is only 1 [19:09:57] is there a way to get bigger ram? Or am I stuck with 4gb [19:10:08] majavah: ^ [19:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [19:10:57] oof that sucks razzi, maybe try in analytics project then [19:11:04] haha ok let's see [19:11:07] there is a way but we have to ask cloud [19:11:45] razzi: you can request more quota if you need it [19:11:55] analytics gives me 8gb, so that's better [19:12:00] how do I do this majavah ? [19:12:19] https://phabricator.wikimedia.org/project/view/2880/ [19:14:28] hmmm, razzi looking more, i can't tell how much profile::docker::engine will help you [19:14:29] cool, I'll start with 8gb, hopefully that'll be enough. If it goes oom I'll look at that process majavah, thanks! [19:14:35] it might be worth trying https://docs.docker.com/engine/install/debian/ first [19:14:47] profile::docker::engine doesn't do much but install a vetted docker debian package [19:15:03] it does set a couple of grub::bootparams (?) [19:15:16] which may or may not be necessary [19:15:20] huh [19:15:58] iirc the data-engineering project was created with the expectation that it will eventually replace the analytics one :/ [19:16:43] hmm, service::docker could help you, but it is hardcoded to only use the wikimedia docker registry [19:16:45] so you can't use it [19:16:47] so [19:17:00] but, checkout the systemd unit it declares to run the docker container [19:17:05] service/docker-service-shim.erb [19:17:25] it has the docker run command i think you'll need to get the ports exposed [19:19:30] ok cool, I see the script, looks pretty simple [19:20:40] majavah: yeah we should probably clean up what currently exists in the analytics project, I'm not currently using anything that's there. Take heart that the data catalog evaluation (what I just created) is not expected to last more than a few months [19:21:20] Maybe we can up the limits for data-engineering, and then make the analytics limit 0 meaning "don't create anything new here" :) [19:24:59] Yeah, when I asked to create the data-engineering project there was a bit of push-back about having two umbrella projects. So I specifically said let's start with low quotas and we can increase them over time as stuff from analytics might get decommissioned. [19:37:37] Really simple cloud vps question for ottomata or anybody else: when I do `sudo apt update && sudo apt upgrade` I get: `E: The repository 'http://security.debian.org bullseye/updates Release' does not have a Release file. N: Updating from such a repository can't be done securely, and is therefore disabled by default.` [19:38:45] hmm, maybe no need to run upgrade? [19:40:41] yeah I guess that's not necessary [19:40:57] Was just following an "install docker on debian" tutorial and they started with that [19:44:16] ok cool https://wikitech.wikimedia.org/wiki/Docker is nice and to-the-point [19:45:41] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster [19:49:39] ok cool docker is running! ottomata you said I could access the cloud vps host on a public domain? [19:50:35] I did `sudo docker run -it nginx -p 80:80` and I can access it locally [19:50:54] so I can probably do ssh forwarding at this point, but sharable urls are nice :) [19:51:05] yes... [19:51:28] https://horizon.wikimedia.org/project/proxy/ [19:51:47] pretty straighforward [19:51:59] create a proxy to your instance [19:52:30] Ok cool that worked! data-catalog-evaluation.wmcloud.org [19:52:34] ty ottomata [19:52:44] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with err... [19:55:44] nice! [20:06:11] The docker image for atlas unfortunately fails due to the same reason mvn install wasn't working on the test cluster; expired certificate for https://maven.restlet.com [20:06:44] (it expired last Saturday) [20:11:55] oh rats [20:15:33] I'll create a ticket to track that issue since it's apparently not going away by itself :) [20:22:50] yeah [20:22:56] mforns: yt? [20:23:07] ottomata: yess [20:23:09] sup [20:23:37] bb bc real quick? [20:24:07] omw [20:38:40] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:42:55] Hm let me see about this alert [20:44:14] 1-12-15T20:30:31.023 INFO ProduceCanaryEvents Succeeded producing canary events to 94 / 95 streams. [20:44:14] 1-12-15T20:30:31.023 ERROR ProduceCanaryEvents Encountered failures when producing canary events to 1 / 95 streams. [20:45:56] It says in the source https://gerrit.wikimedia.org/g/analytics/refinery/source/+/df129ab5813669e7ea915d135726ebdc2caa7da4/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ProduceCanaryEvents.scala `// Failures have already been logged.` but I'm not sure where; ottomata do you know? [20:47:44] razzi: i don't know why it does that sometimes, but it seems to be intermittent. i was wondering if there might be some inconsistency in one of the running eventgate's cached stream configs or schemas [20:49:50] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:50:41] huh ok there's the recovery [20:59:40] (03CR) 10Milimetric: [C: 03+2] "merging to deploy as part of 0.1.23" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [21:07:58] (03Merged) 10jenkins-bot: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [21:10:47] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10razzi) [21:12:59] (03PS1) 10Milimetric: Update changelog.md for v0.1.23 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747615 [21:13:15] (03CR) 10Milimetric: [C: 03+2] Update changelog.md for v0.1.23 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747615 (owner: 10Milimetric) [21:13:20] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update changelog.md for v0.1.23 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/747615 (owner: 10Milimetric) [21:13:59] Starting build #100 for job analytics-refinery-maven-release-docker [21:32:18] Project analytics-refinery-maven-release-docker build #100: 09SUCCESS in 18 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/100/ [22:28:14] Starting build #59 for job analytics-refinery-update-jars-docker [22:28:46] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.23 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747620 [22:28:47] Project analytics-refinery-update-jars-docker build #59: 09SUCCESS in 32 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/59/ [22:35:42] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.23 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/747620 (owner: 10Maven-release-user) [22:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:48:17] 10Analytics, 10Readers-Web-Backlog (Needs Prioritization (Tech)), 10Wikimedia-production-error: eventgate_validation_error: '.web_session_id' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T297521 (10Jdlrobson) Analytics team.. any idea what could be going on here?