[00:00:59] <wikibugs>	 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 4 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10TheDJ)
[00:44:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Add deletion job for old anomaly detection data - https://phabricator.wikimedia.org/T298972 (10odimitrijevic)
[00:44:45] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Add deletion job for old anomaly detection data - https://phabricator.wikimedia.org/T298972 (10odimitrijevic) p:05Triage→03High
[01:04:00] <wikibugs>	 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10odimitrijevic)
[01:05:18] <wikibugs>	 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10odimitrijevic) FTR I migrated all the tickets that are still relevant even for historical reasons to Data-Engineering board.  Here is a broad approach taken:  * I left Wikistats on the analytics board wit...
[05:17:16] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[05:22:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[08:44:02] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for hotfix deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753563 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal)
[08:45:50] <joal>	 !log Kill-restart wikidata-json_entity-weekly-coord after deploy
[08:45:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:49:30] <elukey>	 bonjour
[08:49:37] <joal>	 Good morning!
[09:36:54] <btullis>	 Bore da.
[09:37:58] <elukey>	 TIL :)
[09:38:02] <elukey>	 buongiorno!
[09:38:10] <joal>	 Bonjour!
[09:39:00] <btullis>	 I'm about to to do the upgrade of hive and oozie-client on an-coord1002, then I'll prepare the DNS change to switch coordinators. OK with you?
[09:39:43] <elukey>	 +1
[09:39:57] <joal>	 yessir
[09:41:18] <elukey>	 the hive patch that we are using has been merged in branch-2.3 of Apache Hive upstream
[09:41:55] <btullis>	 Cool. Once both coordinators are done and failed back, I will do a debdeploy for the hive components running on the an-workers and anything left over.
[09:42:00] <elukey>	 not sure if they are going to release a new version, but there is also the option in the future (say before hive 3.x) to jump to hive 2.3.11 or similar
[09:42:15] <btullis>	 elukey: Ah, good to know. Thanks.
[09:42:51] <btullis>	 !log btullis@an-coord1002:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie-client
[09:42:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:53:12] <btullis>	 !log DNS change deployed, failing over hive to an-coord1002.
[09:53:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:04:43] <btullis>	 elukey: might have a small issue with kafka-test. I rebooted one broker and now no messages are flowing. https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=test-eqiad&var-cluster=kafka_test&var-kafka_broker=All&var-disk_device=All&from=now-15m&to=now
[10:08:04] <btullis>	 It's related to certificate errors, I think. Tailing /var/log/kafka/server.log and it's `Caused by: java.security.cert.CertificateExpiredException: NotAfter: Wed Dec 15 11:34:00 UTC 2021`
[10:12:16] <elukey>	 here I am sorry
[10:12:34] <elukey>	 wow the cert is already expired??
[10:15:25] <elukey>	 checking
[10:15:41] <btullis>	 Looks like it. Also checking.
[10:16:01] <elukey>	 it seems not on all nodes
[10:16:39] <elukey>	 1006 says
[10:16:39] <elukey>	 notBefore=Jan 10 00:33:00 2022 GMT
[10:16:39] <elukey>	 notAfter=Feb  5 03:33:00 2022 GMT
[10:16:44] <elukey>	 the others 2021, so expired
[10:18:08] <elukey>	 elukey@kafka-test1006:~$ ls -l /etc/kafka/ssl/kafka_test-eqiad_broker.keystore.p12
[10:18:11] <elukey>	 -r--r----- 1 kafka kafka 2278 Jan 10 00:38 /etc/kafka/ssl/kafka_test-eqiad_broker.keystore.p1
[10:18:15] <btullis>	 Yes. How are you seeing the dates on 1006? I was using this, but it doesn't show the dates unless they fail validation. `openssl s_client -connect kafka-test1006.eqiad.wmnet:9093`
[10:18:38] <elukey>	 I pipe it to | openssl x509 -dates
[10:18:56] <btullis>	 😎
[10:19:10] <elukey>	 elukey@kafka-test1007:~$ ls -l /etc/kafka/ssl/kafka_test-eqiad_broker.keystore.p12
[10:19:13] <elukey>	 -r--r----- 1 kafka kafka 2278 Jan  3 16:21 /etc/kafka/ssl/kafka_test-eqiad_broker.keystore.p12
[10:19:16] <elukey>	 mmmm
[10:19:57] <elukey>	 I have restarted 1007 and now I see the new cert
[10:20:22] <btullis>	 Oh, right. So a rolling restart of the brokers?
[10:20:45] <elukey>	 yeah I think so, these instances have been just rebooted?
[10:21:10] <elukey>	 it is a very weird use case, I'll follow up with John
[10:21:14] <elukey>	 good that we discovered it in test :D
[10:21:17] <btullis>	 Just 1006. They're all going to get rebooted thoug, as part of T294120
[10:21:18] <stashbot>	 T294120: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120
[10:21:28] <elukey>	 ahhhh
[10:22:08] <elukey>	 okok so probably they had long standing tcpconns
[10:22:16] <elukey>	 until 1006 was rebooted
[10:22:33] <elukey>	 and at that point, since the new bundle was not loaded, they started failing
[10:22:40] <elukey>	 does it make sense?
[10:23:04] <elukey>	 there is probably a way to force a longer expire date, and we'll need alarms
[10:24:35] <btullis>	 Yes I think so. Still strange that they should all fail to pass messages when one broker drops out though, given that they all had long-running connections to each other. Maybe there is something to say that they should all refresh their inter-broker connections when topology changes.
[10:25:47] <btullis>	 We should have the kafka service subscribe to the cfssl keystore, shouldn't we? So that we get automatic restarts. We discussed this in relation to presto as well, when we started using cfssl for that (in test).
[10:26:46] <elukey>	 I think we shouldn't auto-restart, it can bring to a dangerous place, for example if a lot of brokers all restart at once
[10:26:49] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?viewPanel=51&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=test-eqiad&var-cluster=kafka_test&var-kafka_broker=All&var-disk_device=All
[10:26:53] <btullis>	 Or vice-versa, have the keystore notify the service.
[10:26:55] <elukey>	 1006 was acting as controller
[10:27:16] <elukey>	 a new one was elected, and all the others had to connect
[10:27:43] <elukey>	 what I think we should do is adding tls cert expire checks, and extend the validity of them to 1/2yrs
[10:27:59] <elukey>	 once we get the alert then we can roll restart
[10:28:01] <btullis>	 Oh yes, I see. I'd mis-read that graph. I thought that 1007 was the controller before the incident.
[10:34:18] <btullis>	 I see your point. I'm just naturally inclined to try to avoid toil that we have to perform manually and alerts that we fire intentionally. 
[10:35:13] <btullis>	 Can we think of any other way of avoiding a thundering herd of restarting kafka brokers, yet have them pick up the new certificates automatically?
[10:38:03] <btullis>	 I have restarted the remaining brokers in kafka-test
[10:38:09] <elukey>	 I can't imagine a lot of alternatives for systems like kafka
[10:38:46] <elukey>	 in this case the validity for the certs is too short, we have the same problem with the puppet-generate ones, but the expire is set to say +5y
[10:39:30] <elukey>	 an auto-restart for Presto is probably fine
[10:40:19] <elukey>	 the workers and coordinator don't hold a lot of state
[10:40:45] <elukey>	 so even in the case of multiple restarts in a short timeframe the worst that can happen is having some requests dropped
[10:41:08] <elukey>	 for kafka it can bring to data-loss and services screaming all around the infra :D
[10:42:42] <btullis>	 OK, so the check logic would be like:
[10:42:42] <btullis>	 * running kafka broker SSL dates don't match the dates of on-disk certificate
[10:42:42] <btullis>	 * Runbook says: ensure that all brokers are showing the alert, then run cookbook to roll-restart, before `notAfter` date.
[10:44:29] <btullis>	 In this example, we had 25 days or so between `notBefore` and `notAfter` so at least we would have a long window of time during which we could run the cookbook. That would cover an xmas holiday for example.
[10:44:41] <elukey>	 I would start with something simpler, namely a check that warns when the certs are about to exprire (say in a week time etc..). We already have them in various places, it is generally very good (independently from automation)
[10:47:05] <btullis>	 OK, you mean just check the certs of the running kafka brokers? 
[10:47:21] <elukey>	 exactly yes
[10:48:13] <elukey>	 now that I think about it, not sure if I have done it for the hadoop shufflers (they use the host puppet tls cert, that expires in ages)
[10:52:55] <btullis>	 OK. Should we check the other kafka clusters now, while we think about it? 
[10:53:31] <elukey>	 I am not sure if we have the checks but they are not active for kafka test, or if they need to be added
[10:56:31] <btullis>	 It looks to me like we don't have a certificate expiry check for any Kafka clusters.
[10:57:28] <elukey>	 kafka jumbo has
[10:57:29] <elukey>	 notBefore=Dec  4 14:47:46 2017 GMT
[10:57:29] <elukey>	 notAfter=Dec  4 14:47:46 2022 GMT
[10:57:46] <elukey>	 kafka main 2023
[10:58:09] <elukey>	 same for kafka logging
[11:03:09] <btullis>	 Kafka main is still running the puppet CA. I thought you had updated it?
[11:03:13] <btullis>	 https://www.irccloud.com/pastebin/Gkm4Obmk/
[11:09:03] <wikibugs>	 (03CR) 10Phuedx: "A good start. Thanks for picking this task up!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[11:09:11] <elukey>	 nono I haven't moved any non-test cluster
[11:09:33] <elukey>	 the change requires all clients to switch ca-bundle to trust
[11:09:45] <elukey>	 the list is huge, I am currently doing it for jumbo
[11:10:47] <btullis>	 Ah, right. Sorry. I had got out of sync. I thought that they had all been moved, but it's just the ca-bundle containing both the puppet and PKI certs (but nothing else) that you've been doing. Is that right?
[11:11:13] <wikibugs>	 (03CR) 10Phuedx: WIP: Basic ipinfo instrument setup (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte)
[11:11:34] <btullis>	 I didn't mean to belittle it by saying "...just the ca-bundle" there. :-)
[11:11:35] <elukey>	 correct
[11:11:41] <elukey>	 yes yes :D
[11:12:43] <btullis>	 Right, back on the same page now. Thanks. OK, so we're not in any immediate danger for the other clusters having expiored certificates. I can create a ticket for implementing certificate expiry checks for kafka.
[11:13:44] <elukey>	 btullis: I think that we need to restart hive-server2 on an-coord1002, it doesn't have the right xmx settings etc..
[11:14:15] <elukey>	 I think that the upgrade of the package started the daemon before puppet was able to modify hive-env.sh
[11:14:17] <btullis>	 Oh. Looking now.
[11:15:16] <btullis>	 Should I upgrade an-coord1001, run puppet, restart the daemon, then fail back with DNS? Would that be the most seamless option?
[11:16:12] <btullis>	 Or is it better simply to restart the services directly on an-coord1002 while it is in service?
[11:17:40] <btullis>	 Sorry, it didn't occur to me that hive-env.sh would be in the package as well and would need puppet to modify it after installation.
[11:18:30] <elukey>	 we can quickly restart the hive server, jobs will cause heap errors for sure
[11:18:50] <btullis>	 !log btullis@an-coord1002:~$ sudo systemctl restart hive-metastore hive-server2
[11:18:51] <elukey>	 it may impact some ongoing jobs but we can restart it
[11:18:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:18:55] <elukey>	 *them
[11:19:11] <elukey>	 perfect looks good now 
[11:21:05] <btullis>	 Also restarting hive and oozie services on an-coord1001 now.
[11:23:40] <btullis>	 !log btullis@an-coord1001:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie oozie-client
[11:23:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:24:18] <btullis>	 Running puppet agent on an-coord1001
[11:26:14] <btullis>	 !log restarted hive-metastore and hive-server2 on an-coord1001 after running puppet.
[11:26:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:29:52] <icinga-wm>	 PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:31:14] <icinga-wm>	 RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:49:12] <btullis>	 I have been asked to reboot eventlog1003. Just checking whether this is OK. I think it is, because it's just a Kafka consumer and will carry on processing where it left off after boot, but if anyone has any guidance, please let me know.
[11:52:55] <btullis>	 !log Upgrading hive packages on stat1005
[11:52:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:54:57] <elukey>	 btullis: yes it is fine to reboot it, if you want to be super careful you can stop eventlogging on the node
[11:55:12] <elukey>	 (there should be a catch all unit that stops all daemons)
[11:55:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10BTullis) I have tried the test case with the upgraded hive packages, but they do not fix the issue. I will now change the logging handler for parquet, so that it logs...
[11:55:50] <btullis>	 elukey: Thanks. Will do.
[11:59:01] <btullis>	 !log stopped eventlogging service on eventlog1003 with 1 hour's downtime.
[11:59:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:39:50] <joal>	 btullis: is it ok for me to restart failed oozie jobs?
[12:40:48] <joal>	 no answer, positive answer :) doint it nowe
[12:41:12] <joal>	 !log rerun failed instances of webrequest-load-coord
[12:41:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:44:08] <joal>	 ottomata: good morning, I have a naming question for you when you are up to it :)
[13:02:17] <joal>	 ottomata: for the new network-flows, the current name (no data flowing yet) is network_flows_internal - I suggest we make it network_internal_flows - any preference?
[13:39:03] <btullis>	 joal: Apologies for not answering. Was out to lunch. Many thanks for re-reunning them.
[13:41:14] <btullis>	 I confess I was still trying to work out how to re-run them. I still struggle with the Hue interface and knowing how to re-submit.
[14:08:10] <ottomata>	 o/ joal 
[14:08:24] <ottomata>	 i forget, isn't the software thing that creates them called netflow?
[14:08:25] <ottomata>	 so
[14:08:41] <ottomata>	 maybe keeping network and flow next to each other is better than separating?
[14:26:30] <btullis>	 I prefer `network_flows_internal` because they're "network flows" and they're "internal"
[14:27:59] <ottomata>	 agree
[14:28:01] <btullis>	 The thing that creates them is called 'sflow' unfortunately. The 's' means 'sampled' because it's just like netflow, but intended for higher volume traffic where sampling is the only realistic way to analyse it.
[14:28:17] <ottomata>	 snetflow
[14:28:17] <ottomata>	 :p
[14:28:37] <ottomata>	 when naming things, i find that the natural english way to do it is often pretty inconsistent
[14:28:44] <ottomata>	 where desciptors go before nouns
[14:28:47] <ottomata>	 so
[14:28:53] <ottomata>	 interal_flows vs external_flows
[14:29:10] <btullis>	 Yeah. :-) The thing that creates the other one is called 'netflow' but I'd support renaming that one to `network_flows_external` unless it's a lot of work.
[14:29:13] <ottomata>	 it puts the emphasis, and alphanumeric sorting, on the descriptor, rather than  the entitiy
[14:29:19] <ottomata>	 bettery to keep entity first
[14:29:20] <ottomata>	 hence
[14:29:25] <ottomata>	 eventgate-analytics-external
[14:29:27] <ottomata>	 insttead of
[14:29:33] <ottomata>	 the external analytics eventgate
[14:30:42] <btullis>	 We should use Welsh where the adjective goes after the verb. :-) 🏴󠁧󠁢󠁷󠁬󠁳󠁿
[14:32:29] <joal>	 huhu
[14:33:03] <joal>	 ok - I don't think there is plan to rename original netflow for now, but we'll go with "network_flows_internal" for that one :)
[14:33:19] <joal>	 thanks ottomata and btullis for the brainbounce
[14:33:27] <ottomata>	 btullis:  i'm down
[14:33:31] <ottomata>	 :)
[14:33:39] <btullis>	 joal: Always a pleasure.
[14:33:43] <ottomata>	 yeah it would be nice to rename it but probably not worth the effort
[14:33:54] <ottomata>	 might be worth a comment somewhere saying "if we could we would :) "
[14:46:25] <elukey>	 ottomata: o/ how should we handle the eventgate analytics deployments?
[14:46:31] <elukey>	 (never done one)
[14:46:45] <elukey>	 not really urgent, we can deploy next week too
[14:49:40] <ottomata>	 elukey:  should be easy peasy
[14:49:59] <ottomata>	 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate_service_values_config_change
[14:50:11] <ottomata>	 https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Code_deployment/configuration_changes
[14:50:37] <ottomata>	 basically, after merging, stetps 6 and 7 in that last link
[14:57:00] <joal>	 woops sorry ottomata I ping you on two chans
[14:57:10] <joal>	 anyhow - leaving for now, back later o/ 
[14:58:17] <elukey>	 ottomata: yeah but I'd be more comfortable if there was somebody helping in reviewing the deployment after staging for example :D
[15:00:47] <ottomata>	 elukey:  i can help
[15:01:06] <ottomata>	 anytime!
[15:01:11] <ottomata>	 now is fine if you wanna
[15:01:53] <ottomata>	 joal:  saw it.  
[15:02:52] <elukey>	 ottomata: ack!
[15:02:53] <ottomata>	 joal:  i'll merge with you when you are around later
[15:04:02] <ottomata>	 elukey:  if the deployment works...it should be fine.  the eventgate deploymentt chart uses a test event readiness probe.  a pod won't be pooled until it can successfully produce an event
[15:04:45] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/templates/deployment.yaml#58
[15:06:04] <elukey>	 last famous words :D
[15:19:12] <elukey>	 razzi: o/ would you be interested in doing some eventgate k8s deployments? Context: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/753425
[16:01:26] <razzi>	 elukey: yeah I'd have to learn some but I'm always down to learn!
[16:02:11] <elukey>	 ack I'll leave it to you and Andrew then :)
[16:21:41] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset: Superset SQL Lab error - https://phabricator.wikimedia.org/T298699 (10razzi) 05Open→03Resolved a:03razzi
[16:32:56] <wikibugs>	 10Data-Engineering, 10serviceops, 10Patch-For-Review: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10herron)
[16:33:11] <wikibugs>	 10Data-Engineering, 10serviceops, 10Patch-For-Review: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10herron)
[16:38:30] <wikibugs>	 10Data-Engineering, 10serviceops, 10Patch-For-Review: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10elukey) 05Open→03Resolved a:03elukey
[17:47:13] <joal>	 ottomata: Heya - is now a good time for merging the puppuet refine patch?
[17:47:32] <ottomata>	 joal sure!
[17:47:38] <joal>	 cool :)
[17:48:22] <joal>	 The thing to monitor will be refine for netflow - the hive-to-druid new jobs won't happen before tomorrow
[17:50:48] <ottomata>	 aye
[17:50:57] <ottomata>	 ok merged joal
[17:51:12] <ottomata>	 going to launch a refine_netflow just to see
[17:51:25] <joal>	 awesome thank you 
[17:51:52] <joal>	 ottomata: we don't yet have data for the new dataset - but it's important the old one still works
[17:52:14] <ottomata>	 oh
[17:52:14] <ottomata>	 k
[17:52:34] <ottomata>	 well, nothing new to refine but it ran ok
[17:53:50] <joal>	 ok great ottomata - I'll triple check data after the next run with something refined
[17:54:04] <joal>	 thanks again :)
[18:05:18] <joal>	 ottomata: sending aanother patch - I realized a line I changed was not saved in the previous one - I'm sorry for that :(
[18:08:13] <joal>	 Added you as a reviewer ottomata - https://gerrit.wikimedia.org/r/c/operations/puppet/+/753794/
[18:11:36] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of christinedk - https://phabricator.wikimedia.org/T297461 (10razzi) Here are the remaining files and tables for christinedk: ` hack $ ./user_left.bash christinedk  ====== stat1004 ====== total 0  ====== stat1005 ====== total 0  ====== st...
[18:19:10] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_network_internal_flows_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_internal_flows_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:49:00] <razzi>	 I just glanced at ^ and I see `NoSuchObjectException(message:event.network_internal_flows table not found)`, anybody familiar with this table?
[18:49:49] <razzi>	 Looks like a hive table, since the method that threw the exception is `at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:35066)`
[18:50:04] <ottomata>	 joal: ^
[18:50:06] <ottomata>	 oh another patch!
[18:50:30] <ottomata>	 joal:  merged
[19:03:04] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of christinedk - https://phabricator.wikimedia.org/T297461 (10diego) yes.
[19:50:15] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Requirements - https://phabricator.wikimedia.org/T294258 (10odimitrijevic)
[19:50:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Technical Evaluation - https://phabricator.wikimedia.org/T293643 (10odimitrijevic)
[19:51:56] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Connect Atlas to a Data Source - https://phabricator.wikimedia.org/T298710 (10odimitrijevic)
[19:55:43] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas - https://phabricator.wikimedia.org/T299165 (10razzi)
[19:56:08] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Connect Atlas to a Data Source - https://phabricator.wikimedia.org/T298710 (10Milimetric) Atlas doesn't seem to have a first-class connector for MySQL / MariaDB, people have created scripts that use the REST API to manage this kind of import.  Th...
[19:56:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi)
[19:56:21] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas - https://phabricator.wikimedia.org/T299165 (10razzi)
[19:56:23] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Technical Evaluation - https://phabricator.wikimedia.org/T293643 (10razzi)
[19:57:45] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run Atlas on cloud services cluster - https://phabricator.wikimedia.org/T299166 (10razzi)
[19:59:47] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run Atlas on cloud services cluster - https://phabricator.wikimedia.org/T299166 (10razzi) Currently adding a hive docker container
[20:12:54] <joal>	 mwarf - thanks razzi for looking into that - Since there is no data, the table hasn't been created, and the hive-to-druid job is not happy :(
[20:12:57] <joal>	 MEH :(
[20:14:48] <joal>	 ottomata: Do you think absenting the 2 jobs for now is a correct approach?
[20:21:01] <ottomata>	 hm, joal  the refine job too?
[20:21:04] <ottomata>	 the refine job is probably fine
[20:21:09] <joal>	 refine is fine
[20:21:16] <ottomata>	 when will the druid job have the data?
[20:21:17] <joal>	 indeed
[20:21:42] <joal>	 hm, in the next days or so I imagine - when Arzhel opens up the pipes
[20:21:55] <ottomata>	 oh ok yeah
[20:21:57] <ottomata>	 then probably
[20:23:14] <joal>	 ottomata: we're gonna get small data at the beginning - getting sflow is not easy with the current network architecture, easier with the new one- so we're gonna get data from Marseille DC first, then the nex rows of eqiad with the new archi, and then they're gonna assess how to send more
[20:23:25] <joal>	 ottomata: then probably, we should absent?
[20:23:54] <ottomata>	 yeah probably
[20:24:05] <joal>	 ack - sending a patch right now - sorry for the mess
[20:24:12] <ottomata>	 k np
[20:27:01] <joal>	 actually I'll send a patch when I manage to pull the latest version of the puppet repo... Slow internet tonight :(
[20:29:26] <ottomata>	 joal:  k, in meeting but i can prob do it after if you prefer
[20:40:00] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Technical Evaluation - https://phabricator.wikimedia.org/T293643 (10razzi) @BTullis perhaps you already saw but @Milimetric considered CKAN [on the rubric](https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evalua...
[20:42:44] <joal>	 ottomata: patch is on its way, I'll flag you as reviewer when if finally gets to gerrit
[21:44:47] <wikibugs>	 10Analytics-Radar, 10SRE, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762 (10nshahquinn-wmf) 05Resolved→03Declined This was declined rather than resolved.
[22:22:44] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Been experimenting with how we might create Airflow + Skein + Spark submit integration: Here's an example [[ https://gist.github.com/ottomata/e735c930c9a7f3eff34e874b6651f04f...
[22:57:36] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run Atlas on cloud services cluster - https://phabricator.wikimedia.org/T299166 (10razzi) Found a dockerfile for hive: https://github.com/IBM/docker-hive/  Built a docker image:   `razzi@data-catalog-evaluation:~/mnt/docker-hive$ docker...