[00:04:00] <wikibugs>	 (03CR) 10Nray: [C: 03+2] Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739016 (https://phabricator.wikimedia.org/T294777) (owner: 10Clare Ming)
[00:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739016 (https://phabricator.wikimedia.org/T294777) (owner: 10Clare Ming)
[01:13:47] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10awight) >>! In T291120#7503969, @Ottomata wrote: >> which becomes problematic if a revision is...
[03:40:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 30 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org
[03:40:18] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 20 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org
[03:50:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 30 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org
[03:50:18] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 20 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org
[08:12:01] <wikibugs>	 10Analytics: Check home/HDFS leftovers of jmixter - https://phabricator.wikimedia.org/T295748 (10MoritzMuehlenhoff)
[08:15:15] <tanny411>	 joal: Hi. I am trying to test the jar file with input path : `file:///mnt/data/xmldatadumps/public/commonswiki/entities/20211108/commons-20211108-mediainfo.json.bz2` but it says file not found.  Any idea?
[08:16:02] <joal>	 tanny411: using file:// in spark is subjext to the file being accessible to all workers at the sam path, which is not the case :)
[08:16:15] <joal>	 tanny411: you should upload the file HDFS first
[08:16:37] <tanny411>	 humm... okay. 
[09:03:29] <tanny411>	 joal: so i copied it using `-put` and i can see it in  hdfs, it still says  cannot find `hdfs://analytics-hadoop/user/akhatun/commmonsjsondump.bz2`
[09:04:31] <tanny411>	 `Input path does not exist:` do I have to have it in a folder or something?
[09:05:50] <joal>	 that's weird tanny411 - the file is indeed there :(
[09:06:19] <joal>	 tanny411: how do you launch the job?
[09:06:50] <tanny411>	 `spark2-submit --master yarn --driver-memory 16G --executor-memory 32G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=32 --conf spark.executor.memoryOverhead=8196 --class org.wikimedia.analytics.refinery.job.wikidata.jsonparse.StructuredDataJsonDumpConverter refinery-job-0.1.21-SNAPSHOT-shaded.jar -i commmonsjsondump.bz2 -o commonsdumps -p commons`
[09:07:40] <tanny411>	 joal: input is `-i commmonsjsondump.bz2`
[09:08:10] <joal>	 hm
[09:08:34] <joal>	 tanny411: I'd suggest using absolute path for input and output
[09:08:58] <tanny411>	 joal: starting with hdfs://... ?
[09:09:24] <joal>	 if you prefer but you don't need the hdfs:// bit (can be : /user/akhatun/...)
[09:09:33] <tanny411>	 okay, checking
[09:10:20] <tanny411>	 joal: nope, same issue
[09:11:25] <joal>	 meh :S
[09:11:55] <tanny411>	 ohhhh.....there's a sneaky typo :|
[09:12:10] <tanny411>	 joal: so sorry, i put mmm instead of mm
[09:12:12] <joal>	 Ahhh! good catch :)
[09:12:32] <tanny411>	 I was starting to doubt my code :v
[09:12:59] <joal>	 I was starting to wonder how to help tanny411 :)
[09:13:06] <tanny411>	 hahaha
[09:56:00] <btullis>	 Morning all. I'm shortly going to run the sre.hadoop.roll-restart-masters cookbook for the analytics cluster, as part of https://phabricator.wikimedia.org/T295673 - Hopefully there won't be any interruption to the Hadoop service.
[10:20:01] <btullis>	 !log roll-restarting hadoop masters 
[10:20:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:40:47] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 2 others: Add agent_type and access_method to event data - https://phabricator.wikimedia.org/T294246 (10ovasileva) Thanks @Ottomata for the update, and @cjming for picking this up.  I'll edit the task description here t...
[10:43:14] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 2 others: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10ovasileva)
[10:43:55] <wikibugs>	 (03PS1) 10AKhatun: Save commons json dumps as a table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834)
[10:45:08] <tanny411>	 ^^ joal: patch sent. We will need some refactoring of folder structure. I thought you can look at the code to give me some hint on that.
[10:46:50] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org
[10:46:50] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org
[10:47:32] <btullis>	 Looking at this now.  ---^ 
[10:48:14] <btullis>	 The alarm might be firing due to the restarting of the namenode services as part of the hadoop cookbook.
[10:50:26] <btullis>	 From wikitech:
[10:50:26] <btullis>	 > If there are roll restart of Hadoop HDFS Datanodes/Namenodes in progress, or if one was performed recently. In the past this was a source of false positives due to the JMX metric reporting a temporary weird values. In this case always trust what the fsck command above tells you, it is way more reliable than the JMX metric (from past experiences).
[10:51:48] <btullis>	 I'm going to allow the cookbook to finish before running an `fsck`. Currently JMX is reporting 94 corrupt blocks via an-master1002 and 0 via an-master1001.
[10:53:20] <elukey>	 yes it happened in the past, I can confirm
[10:54:31] <elukey>	 btullis: in general I think it is best to wait for gc metrics to recover before proceeding with the next node
[10:55:22] <elukey>	 (and other metrics) but some noise may happen
[10:55:26] <elukey>	 (like the aboce)
[10:55:30] <elukey>	 above
[10:56:36] <btullis>	 elukey: Thanks. You mean wait for GC counts to drop back to normal, like this one? https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=3&orgId=1&var-hadoop_cluster=analytics-hadoop&from=now-1h&to=now
[10:56:50] <jinxer-wm>	 (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org
[10:56:50] <jinxer-wm>	 (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org
[10:57:54] <elukey>	 btullis: yes yes exactly
[10:58:47] <elukey>	 just as precaution, the more our hdfs file size grows the more those jvms will take to load :(
[10:59:12] <btullis>	 Yes, I see. OK. Will try to do that in future. I wasn't really aware before that it was an issue.
[11:01:17] <elukey>	 it is mostly a precaution, not a real issue, so most of the times works even if not waiting
[11:01:24] <btullis>	 I was also wondering if there was a more graceful way to roll-restart the workers, given that my cookbook yesterday caused an Oozie job to fail.
[11:03:30] <elukey>	 the current cookbook has some limits, for example it doesn't restart more than two Datanodes at the time since it is not rack aware 
[11:04:10] <btullis>	 Also, I find it quite hard to keep concentrating on the restart-master cookbook and its timings, given that there are at least two 10 minute waits involved in running it. :-)
[11:04:40] <elukey>	 I didn't get it
[11:05:01] <elukey>	 there are too many wait timings?
[11:05:15] <elukey>	 (we can always change it, it is not set in stone :)
[11:05:30] <elukey>	 anyway, for the failed oozie job, I assume it failed since the connection to a datanode broke
[11:05:47] <elukey>	 (the nodemanager can be restarted without impacting ongoing jobs)
[11:06:16] <elukey>	 so a more graceful way could be to have some sort of decommission of hdfs datanodes, but I am afraid that it will try to replicate blocks elsewhere
[11:06:36] <btullis>	 Yes, sorry I was talking about two different things at the same time.
[11:09:38] <btullis>	 Restarting the *masters* has two periods of `Sleeping 600.0 seconds.` - It's not set in stone and it's already a parameter so I can vary it, it's just that now I have to watch this window *and* check the GC timings, corrupt blocks etc, before I can type `go` again. I'm just thinking out loud about whether we can make it more hands-off and still robust.
[11:11:45] <elukey>	 ah yes sure! The 10 mins were an attempt to force people to wait for the GC timings basically, that mostly are due to the namenode reading the last fsimage from disk (that takes time, and during that window the namenode is not ready to serve requests)
[11:12:05] <elukey>	 another option could be to make the cookbook prometheus-aware and watch for metrics
[11:12:51] <elukey>	 the gc timings are a little weird since they may have spikes even after the main "recovery" post-restart
[11:13:24] <elukey>	 but if we find a good metric that highlights the fact that the namenode is ready, then it would be totally a great improvement
[11:13:44] <btullis>	 Ah, right, thanks for the clarification about the restart of the nodemanager vs datanode processes. Yes, that's probably where the oozie job failed. 
[11:14:28] <btullis>	 I read about the datanode decommissioning process, but I agree that it's too heavyweight for a rolling restart process. 
[11:14:54] <elukey>	 for the namenode, this is the issue
[11:14:56] <elukey>	 2021-11-16 10:40:53,555 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 55399940 INodes.
[11:14:59] <elukey>	 2021-11-16 10:42:25,852 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Successfully loaded 55399940 inodes
[11:15:02] <elukey>	 2021-11-16 10:42:53,744 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3322ms
[11:15:04] <btullis>	 I agree that making the cookbook prometheus-aware would be the gold standard approach and finding the right metrics to monitor would be great.
[11:15:05] <elukey>	 GC pool 'G1 Young Generation' had collection(s): count=1 time=3785ms
[11:15:08] <elukey>	 2021-11-16 10:43:29,729 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1279ms
[11:15:11] <elukey>	 GC pool 'G1 Young Generation' had collection(s): count=1 time=1644ms
[11:15:14] <elukey>	 2021-11-16 10:43:30,746 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 157 seconds.
[11:15:19] <elukey>	 and that number keeps increasing :(
[11:16:11] <elukey>	 after loading the FSImage the namenode gets the remaining transactions from the journal nodes (basically last-txid-in-fsimage -> now)
[11:16:21] <elukey>	 but it is very fast
[11:17:52] <btullis>	 You mean that the 157 seconds FSImage load value keeps increasing every time we fail over gracefully, the more stuff we save to HDFS?
[11:21:06] <elukey>	 the latter 
[11:21:39] <elukey>	 ah sorry yes yes (I've read it at first as two separate things)
[11:22:15] <elukey>	 the fsimage file grows over time, and the namenode keeps the hdfs inodes all on heap
[11:23:06] <elukey>	 so it has to bootstrap its view of the hdfs state
[11:23:51] <elukey>	 I've read on the internet that big companies with large hdfs deployments have procedures that last a long time to restart the hadoop masters
[11:24:05] <btullis>	 +1 - Right, makes sense, thanks. Not an easy one to fix.
[11:30:35] <btullis>	 I'm about to run the cookbook to restart the druid-public cluster, unless anyone has any objections...
[11:31:02] <elukey>	 +1 for me
[11:32:30] <btullis>	 !log btullis@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers public
[11:32:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:35:57] <elukey>	 (lunch)
[11:41:28] <elukey>	 ah btullis not sure if I forgot to mention https://github.com/wikimedia/operations-software-druid_exporter#known-limitations
[11:41:48] <elukey>	 I used to roll restart the prometheus druid exporters after a cluster roll restart
[11:42:03] <elukey>	 I don't recall if we added this step to the cookbook
[11:42:12] <elukey>	 but metrics may appear weird if not
[11:42:27] <elukey>	 (need to go now but we can chat later in case)
[11:42:38] <btullis>	 OK, thanks. Good to know.
[12:05:17] <wikibugs>	 10Analytics: Check home/HDFS leftovers of jmixter - https://phabricator.wikimedia.org/T295748 (10JAllemandou) 05Open→03Resolved a:03JAllemandou Result of our data checking script: ` ====== stat1004 ====== total 0  ====== stat1005 ====== total 0  ====== stat1006 ====== total 0  ====== stat1007 ====== total...
[12:31:03] <wikibugs>	 10Analytics, 10Infrastructure-Foundations, 10SRE-tools, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans)
[12:31:17] <wikibugs>	 10Analytics, 10Infrastructure-Foundations, 10SRE-tools, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) p:05Triage→03Medium
[13:22:23] <btullis>	 joal: Do you have any insight into why the aqs_endpoint_health check is currently showing a warning on the aqs_next cluster? https://alerts.wikimedia.org/?q=alertname%3DIcinga%2Faqs%20endpoints%20health
[13:39:17] <btullis>	 I don't quite follow the links yet as to how these checks work. I know that we put in some fake data for 1970/01/01 https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Add_fake_data_to_Cassandra_after_wiping_the_cluster
[13:39:54] <btullis>	 I know that the check itself tries to do a check against the swagger spec.
[13:39:59] <btullis>	 https://www.irccloud.com/pastebin/TjwRvJlK/
[13:45:56] <btullis>	 I can see that the spec for that endpoint is defined here: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/aqs/+/refs/heads/master/v1/edits.yaml#116
[13:45:56] <btullis>	 ...but then this is where I'm failing to connect any more dots.
[13:54:25] <btullis>	 Is it because */edits* isn't a tablespace that we have configured in cassandra?
[13:56:26] <wikibugs>	 (03CR) 10Ottomata: [wip] Start of new presto query logger schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/738987 (https://phabricator.wikimedia.org/T269832) (owner: 10Razzi)
[13:58:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10jcrespo) This is probably related to this maintenance, but backups on the analytics meta database f...
[14:11:59] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Ottomata) > I only call this problematic because it's important that our downstream consumers...
[14:13:01] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 2 others: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10Ottomata) a:05Ottomata→03cjming
[14:14:45] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10Ottomata) Is there a special manual grant added for the dump user?  @btullis restored the analytics...
[14:16:37] <wikibugs>	 (03CR) 10Ottomata: Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson)
[14:18:03] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10jcrespo) >>! In T284150#7506680, @Ottomata wrote: > Is there a special manual grant added for the d...
[14:18:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10Ottomata) Thank you!
[14:19:06] <wikibugs>	 10Analytics, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10elukey) The only recent thing that I recall is T276239, but not for all workers mentioned. I checked quickly the dry-run for...
[14:31:15] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10jcrespo) New backup is running now, will ask you for a review when finished, to make sure expectati...
[14:33:33] <joal>	 hey btullis - that's weird
[14:36:27] <joal>	 btullis: from the message it is as if the check was run without parameters?
[14:44:53] <ottomata>	 mforns:  just added comments to your MR
[14:48:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10jcrespo) For your review:  Backed up dbs: ` /srv/backups/dumps/latest/dump.analytics_meta.2021-11-1...
[15:09:02] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10Ottomata) Awesome, thank you Jaime!
[15:12:55] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10BTullis) Ah, fabulous, thanks @jcrespo.  I had found some grants for the dump user defined here: ht...
[15:14:23] <btullis>	 joal: Yes I thought it was weird too. The command that seems to be executed is this (on aqs1010 anyway):
[15:14:23] <btullis>	 `/usr/bin/service-checker-swagger -t 5 10.64.0.40 http://10.64.0.40:7232/analytics.wikimedia.org/v1`
[15:21:53] <wikibugs>	 10Analytics, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) Yes I thought this was a bit odd. I saw there was a bit of re-imaging here: T231067#6891049 but that was before my t...
[15:22:10] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) a:03BTullis
[15:22:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10jcrespo) Grant management and checking is a pending task we have to solve, but it is not easy for a...
[16:07:08] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) ==an-worker1104== ====Current interfaces snapshot: {F34750374,width=600} ====Current in...
[16:20:22] <milimetric>	 grrr... connection's just been bad...
[16:38:22] <ottomata>	 mforns:  btw, i'm playing with a new wmf_airflow_lib python module, seeing if i can make a basic code structure with our ideas
[16:48:07] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) I think that it might be useful to look first at getting the messages into rsyslog, then into logstash, before we go straight to...
[17:16:09] <wikibugs>	 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia before wmf8 is on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10jeena) Hello, I am train conductor this week. Is this resolved now?
[17:21:20] <Nettrom>	 hey a-team! I'm getting "Connection refused" errors trying to run a spark query in a Jupyter notebook on stat1006. Things seem to work just fine from the command line. Has something changed that requires Jupyter to update?
[17:21:46] <wikibugs>	 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): Lower sampling rate for MobileWebUIClickTracking on English Wikipedia before wmf8 is on English Wikipedia - https://phabricator.wikimedia.org/T295432 (10nray) 05Open→03Resolved Hi @Jeena,  Yes the changes have been deployed a...
[17:21:51] <joal>	 Nettrom: not that I know! I assume you have kinit-ed
[17:22:10] <Nettrom>	 yes, because otherwise the CLI query would fail with a permission denied error
[17:22:25] <Nettrom>	 it's only when running it from Jupyter I get this connection refused error
[17:22:27] <mgerlach>	 Hi I am not able to run a spark-session via jupyter on stat1008. I usually start via wmfdata.spark.get_session() but now get an error
[17:22:31] <milimetric>	 hm, Nettrom there was one thing, btullis just moved something... /srv/ something... hold on
[17:22:43] <milimetric>	 https://phabricator.wikimedia.org/T295346
[17:22:50] <joal>	 Ah! good catch milimetric 
[17:23:06] <milimetric>	 it might be related, given two separate reports, btullis ^
[17:23:17] <mgerlach>	 oh maybe this is related to what nettrom describes
[17:23:18] <Nettrom>	 I've seen the same problem that mgerlach reports, btw
[17:23:43] <btullis>	 Uh-oh. Looking asap.
[17:23:48] <Nettrom>	 spark.run() and spark.get_session() result in the same "Connection refused" error for me
[17:25:04] <mgerlach>	 https://www.irccloud.com/pastebin/oA1GoXAP/
[17:26:06] <mgerlach>	 ^this is what I see in the notebook when trying to start a spark-session
[17:26:35] <btullis>	 I am going to revert the change and test it again more thoroughly.
[17:28:09] <joal>	 btullis: weird, I have spark CLI in scala working :S
[17:28:39] <btullis>	 I also ran `spark2-shell` and it loaded, but with several warnings.
[17:29:31] <mgerlach>	 thanks for checking btullis
[17:30:05] <Nettrom>	 I could load up the same conda environment on the command line, import wmfdata, then use wmfdata.spark.run() to execute a query just fine. It's when running from Jupyter things fail
[17:33:40] <btullis>	 Oh. I see. I've reverted now. Will look at Jupyter issue tomorrow.
[17:34:51] <mgerlach>	 btullis: jupyter works now
[17:35:04] <btullis>	 mgerlach: Thanks for the feedback.
[17:35:05] <Nettrom>	 btullis: thanks so much! works for me again now
[17:35:38] <mgerlach>	 btullis: thanks for the quick fix : )
[17:52:01] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10bd808) >>! In T291120#7506676, @Ottomata wrote: > I think we agree @awight, and I am not attac...
[17:57:37] <joal>	 a-team: deploying refinery-source now, only one change (preventing null-ids row to be deduplicated in refine) - anything else for anyone?
[17:59:45] <joal>	 btullis: do you have minute to talk about alerts?
[18:00:26] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) So at first glance, this looks like the Netbox script will do the right thing. It will...
[18:00:27] <btullis>	 joal: Yes
[18:00:35] <joal>	 cool btullis - batcave?
[18:03:29] <milimetric>	 joal: there was one other task from ottomata in ready to deploy, and I didn't see it in the etherpad
[18:03:48] <milimetric>	 and joal: can I merge the sqoop change before you deploy refinery?
[18:04:08] <milimetric>	 I could test it but I'm pretty sure the query's fine
[18:04:12] <joal>	 no problem for me milimetric - would you mind updating the etherpad?
[18:04:17] <milimetric>	 will do
[18:07:03] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add discussiontools_subscription query to sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) (owner: 10MNeisler)
[18:07:44] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Resolving our conversation for deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/736021 (https://phabricator.wikimedia.org/T290516) (owner: 10MNeisler)
[18:08:00] <milimetric>	 ok, merged and added to etherpad, thanks joal!
[18:08:06] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) @BTullis fwiw +1 from my end, thanks for having a look.
[18:18:03] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Ottomata) > What concretely is traded away in the sense of CAP theorem's framing of the binary...
[18:19:06] <ottomata>	 joal:  thats the only one for me, i was going to try to get one in about access_method, but not anymore! :)
[18:19:15] <joal>	 makes sense ottomata :)
[18:19:17] <joal>	 will proceed
[18:23:50] <joal>	 !log Releasing refinery-source v0.1.21
[18:23:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:24:14] <wmf-insecte>	 Starting build #98 for job analytics-refinery-maven-release-docker
[18:32:27] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 2 others: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10LGoto)
[18:37:57] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #98: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/98/
[19:12:22] <wmf-insecte>	 Starting build #57 for job analytics-refinery-update-jars-docker
[19:12:52] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.21 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/739329
[19:12:52] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #57: 09SUCCESS in 30 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/57/
[19:13:20] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/739329 (owner: 10Maven-release-user)
[19:15:19] <joal>	 !log Deploying refinery with scap
[19:15:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:25:57] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10Ottomata) Weird! Okay, Jaime's account entry in the superset database was incorrect, and specifically, her username was her LDAP CommonName instead of her lowerc...
[19:35:34] <ottomata>	 razzi: yt?  i am struggling with what should be a very simple python module and imports
[19:35:42] <ottomata>	 maybe you can help me! :)
[19:40:23] <joal>	 !log Deploying refinery to HDFS
[19:40:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:47:26] <joal>	 Ok deployment done - Will stop for tonight :)
[19:48:06] <ottomata>	 thanks joal!
[20:05:56] <ottomata>	 mforns:  yt?  got a sec to help with python imports?
[20:10:51] <ottomata>	 i think i have a cirular imports problem
[20:11:33] <mforns>	 heya ottomata sure, can I have 5 mins?
[20:11:39] <ottomata>	 yup
[20:12:02] <ottomata>	 mforns:  qeq-fuuz-chz
[20:12:02] <ottomata>	 meet.google.com/qeq-fuuz-chz
[20:54:28] <wikibugs>	 (03CR) 10Ebernhardson: "do we need to do anything special when merging in this repo?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/735445 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson)
[22:15:49] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10SRE: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10CGlenn)
[22:20:53] <wikibugs>	 (03CR) 10Ottomata: Add performer field to sparql/query (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/735445 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson)
[22:35:57] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T294983 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739359
[22:36:16] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T294983 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739359 (owner: 10GoranSMilovanovic)