[07:37:42] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Spark3 migration - Currently existing airflow jobs - https://phabricator.wikimedia.org/T306955 (10JAllemandou) [07:37:59] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Migrate Cassandra pageview-per-project-hourly Job - https://phabricator.wikimedia.org/T307935 (10JAllemandou) [07:38:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Spark 3 Migration - https://phabricator.wikimedia.org/T309993 (10JAllemandou) [07:38:56] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804574 (owner: 10Joal) [07:41:51] (03CR) 10Joal: Improve efficiency 2x by not looking at upload (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804429 (owner: 10Milimetric) [07:43:53] Ok - launching a refinery deploy to then deploy airflow [07:45:13] !log deploy refinery using scap [07:45:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:02:13] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2003.codfw.wmnet with OS buster [08:05:51] 10Data-Engineering: Update webrequest error thresholds - https://phabricator.wikimedia.org/T310576 (10JAllemandou) [08:30:51] 10Data-Engineering: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10JAllemandou) [08:31:01] btullis: Hi - would you be nearby? [08:31:15] Yes, I'm right here. [08:31:20] Great :) [08:31:35] I need some help - A scap deploy of refinery has failed :S [08:31:59] I think it must have been due to some archiva failure or something [08:32:10] OK, how can I best help? Batcave? [08:32:25] I'd like to try again, but I think we need to clear up the already-pulled cache [08:32:47] Joining [08:40:10] !log Deploying using scap again after failure cleanup on an-launcher1002 [08:40:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:41:43] At some point today I'd like to run the `sre.hadoop.roll-restart-masters` cookbook to test the failover process as a result of T310293 - Any objections to my doing so, or any preferences as to when it is done? [08:41:43] T310293: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 [08:42:25] btullis: thanks for asking :) when you wish is ok for me [08:45:17] btullis: second deploy succeeded - Must have been a glitch - thanks again [08:47:20] Great, thanks for the update joal. [08:48:41] !log roll-restarting hadoop masters T310293 [08:48:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:48:44] T310293: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 [08:56:18] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2003.codfw.wmnet with OS buster completed: - aqs2003 (**WARN... [08:56:37] Oh dear, look at this terrible typo that I did. Slipped past three of us in code review. [08:56:39] https://usercontent.irccloud-cdn.com/file/brmPr3If/image.png [08:57:00] wow indeed btullis! [08:57:12] good catch!! [09:01:00] heya teammm good morning :] [09:01:11] joal: are you doing airflow deploys? :] [09:01:19] Good morning mforns :) [09:01:30] mforns: I've started by a refinery deploy - almost done [09:01:35] I'll foloow with the airflow ones [09:01:39] ok [09:01:48] Hello mforns \o/ [09:01:51] But I have a question first mforns - Shall I change the start_date of dags I rename? [09:01:53] hi btullis :] [09:02:43] joal: you can choose to change the start date in the code and merge, or create an Airflow variable in the UI, overriding the start_date [09:02:59] yeah - is there a prefered way? [09:04:07] btullis: BTW, wanted to ask about a script I'm working on. It's the Airflow development instance script (spins up an airflow instance for dev). My question is: Is it OK (is it an accepted practice in SRE) to add a kill command within such a script? [09:04:24] joal: no, the Variable is just a convenience so that you don't need to redeploy. [09:04:33] joal: you choose? [09:04:47] joal: *you choose! [09:05:05] mforns: discussing this with aqu as well [09:05:49] btullis: the script kills all the Airflow subprocesses when the user types ctrl-C [09:08:04] joal: maybe in the case where the DAG is renamed, since we loose all the DAG history, it might be better to change the start_date in the code, just in case we loose the Variable at some point, and the DAG tries to execute all its history since start_date again... Also maybe just for pure congruence of the code start_date and the start of the DAG's history in the metada? [09:08:04] mforns: Yes I think it's fine. `kill` is just a way of saying `send a signal` to a process. As long as the process is owned by the person running the script, there's no problem. You can send it the default signal 15 which is TERM, or you can say kill -9 which is KILL. Or you can send it different signals which are handled by the script or the binary in different ways. [09:08:37] mforns: that's where we landed with aqu as well - with provide a PR with date-changes soon [09:08:40] thanks mforns [09:08:46] I see btullis thanks! :] [09:08:52] ok joal [09:13:36] A pleasure. Here is a useful list of signals https://en.wikipedia.org/wiki/Signal_(IPC)#Default_action - There's ahandy reference to trapping signals and tidying up processes in bash here https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html (but maybe that's too basic for your needs. [09:13:40] ) [09:16:15] btullis: thank you a lot!! [09:17:20] joal: Is it OK if I also merge Dan's DataHub ingestion Airflow DAG for your deployment? We can leave it paused, and let Dan switch it on, when he arrives. [09:25:21] no problem for me mforns [09:25:28] k [09:25:32] merging [09:26:36] mforns: still talking with Antoine, will change dates soon [09:26:43] no problemo [09:29:19] mforns: what about SandraEbele's PR on Airflow? do you know if it's ready? [09:29:50] I will review it! [09:32:16] FYI, I'M going to merge a patch (https://gerrit.wikimedia.org/r/805197), which will trigger restarts of airflow services, no other impact expected [09:36:22] moritzm: Thanks for the heads-up. I don't think that this is likely to affect any DAGs, with a momentary restart of the scheduler and webserver. Would you agree mforns? [09:36:36] !log deploy refinery onto HDFS [09:36:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:36:45] all done already :-) [09:37:19] btullis: usually it's fine, if there are tasks currently running, they might fail, but we can restart them easily [09:58:25] mforns, aqu : https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/89 [10:03:20] joal: approved! thanks :] [10:03:51] Ok - merging after pipeline, and then deploy! [10:04:10] 10Data-Engineering: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10Antoine_Quhen) Eventually, we may point directly to the local assembly jars located in the workers. Let's try to test this configuration on the test cluster. [10:04:26] 10Data-Engineering: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10Antoine_Quhen) a:03Antoine_Quhen [10:05:57] I plan to merge this change today: https://gerrit.wikimedia.org/r/c/operations/puppet/+/804593 - This will cause Icinga to notify us about errors coming from our hosts, rather than have them go to #wikimedia-operations alone. Please be on the lookout for any unexpected behaviour from Icinga. [10:06:28] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2004.codfw.wmnet with OS buster [10:08:49] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) OK @JMeybohm I've created three CRs that I think should do what we need to finish this. * Adding CNAME records to DNS * Adding service catalog entries... [10:12:30] joal, aqu: I think it would be cool to also deploy the fix for run_dev_instance.sh, since it's broken right now... [10:12:40] !log manually failing back hdfs-namenode to an-master1001 after fixing typo [10:12:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:12:57] Plus, it's needed to test Dan's job, and any other job that uses Skein, or deploy-mode=cluster. [10:13:09] joal aqu: ^^ [10:13:26] The MR is quite long though... [10:13:44] maybe aqu, since you already know that script, can you please review? [10:14:01] We can do an extra deployment after yours joal, no need to wait [10:14:28] mforns: no problem for me - itll take time for me to review! [10:14:56] mforns: I assume you've tested it :) [10:17:49] Ugh, the manual failback of the namenode still failed for some reason. Continuing to investigate, but it's still running happily on an-master1002. [10:17:52] https://www.irccloud.com/pastebin/W6FFd0zm/ [10:20:09] joal: I've tested it [10:20:22] joal: but we don't need to postpone other deploys for this review. [10:20:32] we can merge later in the day [10:20:37] ack mforns - let's wait for aqu review then [10:20:49] :th [10:20:52] :+1 [10:20:55] argh [10:20:59] 👍 [10:21:02] mwarf btullis [10:21:12] btullis: have you updated the waiting time of your cookbook? [10:26:10] mforns: we're facing many pipeline errors due to the infra :( [10:26:17] I'm gonna ping hashar [10:26:33] pong joal [10:26:38] hi! [10:27:01] hashar: we're facing regular pipeline failures on gitlab lately due to disk-space issues [10:27:11] oh joy [10:27:15] hashar: do you have any related knowledge? [10:27:26] we should poke #wikimedia-gitlab [10:27:41] I didn't know the chan [10:27:44] wil do [10:43:02] Ok - last MR merged - Will deploy now [10:43:08] Brace yourself! :) [10:43:35] mforns: open question: will new dags show up in "pause" mode? [10:44:41] !log Deploy Airflow [10:44:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:45:16] I have my answer: they show paused :) [10:46:46] First: I confirm the DAG parsing problem is solved for now :) All dags got parsed super fast! [10:48:34] !log unpause renamed dags [10:48:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:59:22] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:00:45] joal: yes, new dags show up in paused mode [11:01:04] joal: woohoo fast parsing! [11:01:34] btullis: Joseph told me you have some interest in distributed file systems? Do we have anything on our setup beside Swift :) [11:02:19] the context was Gitlab runner running on WMCS filing their disk space due to some docker volume. So maybe we could offload them to a distributed filesystem [11:02:31] then well WMCS has Ceph so I guess it is already solved :] [11:02:59] I don't think I wanna setup a custom distributed file system [11:03:18] that was merely a brain dump, not much to follow up on I guess [11:17:34] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) Cool, thanks! +1ed the first two. The service::catalog entries should be in stage production before switching trafficserver to the discovery record ju... [12:26:29] PROBLEM - AQS root url on aqs2004 is CRITICAL: connect to address 10.192.0.212 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [12:37:23] 10Data-Engineering, 10Data-Engineering-Kanban, 10Observability-Alerting: Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga - https://phabricator.wikimedia.org/T310359 (10BTullis) This change is now merged and I have tested that it has resulted in both hosts... [12:37:49] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga - https://phabricator.wikimedia.org/T310359 (10BTullis) [12:38:21] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2005.codfw.wmnet with OS buster [12:39:59] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2006.codfw.wmnet with OS buster [12:40:18] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2004.codfw.wmnet with OS buster completed: - aqs2004 (**WARN... [12:40:40] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2007.codfw.wmnet with OS buster [12:41:48] hashar: thanks for reaching out. Lots to unpick from your questions above and I'm happy to talk about it any time. [12:41:55] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2008.codfw.wmnet with OS buster [12:42:45] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2009.codfw.wmnet with OS buster [12:45:59] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2010.codfw.wmnet with OS buster [12:47:03] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2011.codfw.wmnet with OS buster [12:47:48] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host aqs2012.codfw.wmnet with OS buster [12:51:29] So, with these gitlab-runners in WMCS, the issue is that they need a file system which looks local to the VM. They're running docker, probably with the `overlay2` storage driver, which needs an `ext4` backing volume. https://docs.docker.com/storage/storagedriver/select-storage-driver/#supported-backing-filesystems [12:52:01] mforns: review done :) [12:56:25] You're right that the WMCS setup uses Ceph to provide these virtual hard drives to the VMs, so it looks like a local disk to the VM but the hypervisor translates this into a request for a *RADOS Block Device* over the network. Those requests are distributed across all Ceph nodes in the WMCS cluster. I think that it is possible to resize these virtual hard drives, but I 've never done it myself on this cluster. I'd start here: [12:56:25] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Resizing_instances [12:59:42] I'm working on a spec for a new Ceph cluster: https://docs.google.com/document/d/1dhAlABcM08zMcw9u01qwukhnw2bf6jQ9rKsRkuRRjdQ/edit#heading=h.xup1vq28kzqd [12:59:42] ...and the idea behind this cluster is to provide *both* block storage like the WMCS Ceph cluster does and HTTP based object storage like the Swift cluster does. It's at an early stage pf design at the moment though, so it's probably six months or so from being usable. [13:06:09] ACKNOWLEDGEMENT - MD RAID on aqs2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.42. Check system logs on 10.192.16.42 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T310610 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:08:16] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) I have verified that we will now get notified via #wikimedia-analytics and email to analytics-alerts@... [13:08:53] PROBLEM - puppet last run on aqs2007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.169: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:09:49] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) [13:13:50] PROBLEM - dhclient process on aqs2012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.189: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:14:13] RECOVERY - puppet last run on aqs2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:15:17] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) > [] Possibly separate image backup storage from namenode data storage partitions This is a bit tric... [13:16:43] PROBLEM - puppet last run on aqs2012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.189: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:17:31] PROBLEM - AQS root url on aqs2007 is CRITICAL: connect to address 10.192.16.169 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [13:21:38] PROBLEM - Host aqs2012 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:22] RECOVERY - puppet last run on aqs2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:22:24] RECOVERY - Host aqs2012 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [13:37:23] --^ I've just added 48 hours of downtime for these aqs2* hosts - Discussion in #wikimedia-data-persistence [13:43:37] btullis: I did try resizing a WMCS volume backed up by Cinder/Ceph and it worked as far as I remember :] [13:44:42] btullis: so we can easily resize the /var/lib/docker by either resizing the Cinder/Ceph volume or removing the old one and create a new larger one [13:44:54] RECOVERY - dhclient process on aqs2012 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:45:18] what would be amazing is to have Docker to have native support for Ceph and talk to it directly but it is probably a whole differnet topic :) [13:46:35] meanwhile for production services, I definitely have a use case to have a distributed file system (rather than the Swift interface). doc.wikimedia.org is an example, it has all the files stored on disk which make it a pain to move the service to another host and I don't think we have any storage on our k8s short of rewriting all of our code to use Swift instead [13:46:36] hashar, that's good that you've been able to resize. I've also been looking at different options for having gitlab-runners automatically perform housekeeping on the docker file system. Stuff like this: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2980 and https://gitlab.com/gitlab-org/gitlab-runner-docker-cleanup [13:47:53] the volumes left behind had a "cache" in their name, so maybe that is used to restore some installed materials between builds. I will raise it in our weekly gitlab syncup meeting [13:47:58] +1 on doing routine cleanup [13:50:07] For the doc.wikimedia.org use case it might make sense for you to add the requirement to somewhere like this: https://phabricator.wikimedia.org/T264291 - It kind of depends on whether you need local file sytem semantics, or whether it would translate to a S3/Swift storage model easily. [13:53:54] btullis: yeah that one definitely requires system semantic short of rewriting our app to be S3 aware :] [13:54:57] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2005.codfw.wmnet with OS buster completed: - aqs2005 (**WARN... [13:59:45] thanks for the merge, mforns! [13:59:48] btullis: I will read the design doc :) thank you! [14:09:01] joal: whenever you want to talk sqoop stuff, I'm around [14:10:37] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10MatthewVernon) It'll take the cookbooks a while to catch up (they back-off in increasing intervals waiting for puppet to be OK), but after some deployment-rel... [14:10:45] 10Analytics-Radar, 10Domains, 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Nemo_bis) Is this related to https://phabricator.wikimedia.org/T255366 ? [14:10:51] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2006.codfw.wmnet with OS buster completed: - aqs2006 (**WARN... [14:12:49] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2007.codfw.wmnet with OS buster completed: - aqs2007 (**WARN... [14:14:42] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2008.codfw.wmnet with OS buster completed: - aqs2008 (**WARN... [14:15:31] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2011.codfw.wmnet with OS buster completed: - aqs2011 (**WARN... [14:16:41] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2009.codfw.wmnet with OS buster completed: - aqs2009 (**WARN... [14:18:37] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2010.codfw.wmnet with OS buster completed: - aqs2010 (**WARN... [14:20:49] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host aqs2012.codfw.wmnet with OS buster completed: - aqs2012 (**WARN... [14:23:42] ottomata: this is the DataHub Kafka emitter, was this what you saw or something else that's even easier? https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/as-a-library.md#kafka-emitter [14:27:31] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) [14:29:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Discovery, 10Generated Data Platform, 10Patch-For-Review: Agree on and adopt WMF scalastyle conventions - https://phabricator.wikimedia.org/T310143 (10Ottomata) [14:30:33] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) [14:32:19] milimetric: Hey that kafka-emitter looks great! [14:33:47] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) [14:36:24] (03CR) 10Btullis: [C: 03+2] Release v0.8.38 of DataHub using WMF customization [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/804611 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [14:37:46] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) [14:37:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Discovery, 10Generated Data Platform, 10Patch-For-Review: Agree on and adopt WMF scalastyle conventions - https://phabricator.wikimedia.org/T310143 (10Ottomata) I don't see how to enforce the others via a maven plugin, but we could publish and use a [[ htt... [14:38:16] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [14:38:18] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) 05Open→03Resolved a:03Eevans [14:38:37] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [14:39:02] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) These are the two remaining things for now: * `hdfs dfsadmin -fetchImage` should have kept failing a... [14:39:38] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) p:05Triage→03Medium [14:43:48] milimetric: btullis, yes I think so! [14:44:12] https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub#datahub-kafka [14:44:21] https://datahubproject.io/docs/architecture/metadata-ingestion/ [14:44:31] although, it'd be much nicer if the sources, e.g. hive metastore, pushed themselves [14:44:46] i'm sure there is a nice way to do that with some kind of metastore plugin? but we'd have to write it [14:47:54] (03CR) 10Btullis: [V: 03+2 C: 03+2] Release v0.8.38 of DataHub using WMF customization [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/804611 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [14:48:48] I think what we REALLY want it so implement a https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/metastore/MetaStoreEventListener.html [14:48:56] that send changes to datahub via kafka [14:49:00] https://towardsdatascience.com/apache-hive-hooks-and-metastore-listeners-a-tale-of-your-metadata-903b751ee99f [14:49:10] milimetric: ^ [14:53:57] (03Merged) 10jenkins-bot: Release v0.8.38 of DataHub using WMF customization [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/804611 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [14:58:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10BTullis) [15:04:20] ottomata: yeah, I agree, that's what I was saying is slightly more complicated. It's nice, but I feel like this catalog is in such early days that people should first get used to the idea and start gathering in this common space before we get too sophisticated. If we go too far too soon, we might lose folks [15:04:27] ottomata: is this ok to merge now? https://gerrit.wikimedia.org/r/c/analytics/refinery/+/792215 [15:07:42] btullis: added the CLI build info to https://wikitech.wikimedia.org/wiki/Analytics/Systems/DataHub/Upgrading [15:10:03] milimetric: Great, thanks. We've got the production containers built for 0.8.38 now, so we could either press on to get 0.8.38 released today, along with the client update... [15:10:28] ...or we could let the job run this evening and spend a bit more time upgrading both tomorrow. [15:19:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10Discovery, 10Generated Data Platform, 10Patch-For-Review: Agree on and adopt WMF scalastyle conventions - https://phabricator.wikimedia.org/T310143 (10gmodena) > I don't see how to enforce the others via a maven plugin, but we could publish and use a scala... [15:29:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Update branding for DataHub to include WMF customizations - https://phabricator.wikimedia.org/T310629 (10BTullis) [15:32:30] (03PS1) 10Btullis: Update branding for DataHub to include WMF customization [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/805408 (https://phabricator.wikimedia.org/T310629) [15:36:04] (03PS2) 10Btullis: Update branding for DataHub to include WMF customization [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/805408 (https://phabricator.wikimedia.org/T310629) [15:36:30] btullis: you can upgrade, I think the old CLI might work with the new version, but not the other way around. [15:39:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Update branding for DataHub to include WMF customizations - https://phabricator.wikimedia.org/T310629 (10BTullis) I have created a patch to implement this suggestion: https://gerrit.wikimedia.org/r/c/analytics/datahub/+/8054... [15:40:40] milimetric: I think I'll aim to do it tomorrow morning UK time, if that's OK with you. Otherwise I'll run out of time before meetings today. [15:42:16] Maybe I could make a patch like this today, so you could check it? https://gerrit.wikimedia.org/r/c/analytics/refinery/+/792215/3/packaged-environments/datahub-cli/setup.cfg [15:42:52] I think I'd rather try to update the server and client around about the same time, if possible. Reduce the chances of weird errors, hopefully. [15:43:43] So maybe I should aim to do it when you and I cross over tomorrow? [15:46:23] sounds good, I'm for it [16:06:09] (03CR) 10Snwachukwu: [WIP] Add projectview hql scripts to analytics/refinery/hql path. (038 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [16:12:57] 10Data-Engineering, 10Data-Engineering-Kanban: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10Antoine_Quhen) [16:13:34] (03PS3) 10Snwachukwu: [WIP] Add projectview hql scripts to analytics/refinery/hql path. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) [16:17:17] (03PS1) 10MewOphaswongse: Add other_reason action_data to image_suggestion_interaction and link_suggestion_interaction schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805418 (https://phabricator.wikimedia.org/T304099) [16:28:51] (03CR) 10Ottomata: [C: 03+1] Add datahub metadata ingestion CLI as a conda env [analytics/refinery] - 10https://gerrit.wikimedia.org/r/792215 (https://phabricator.wikimedia.org/T307714) (owner: 10Milimetric) [16:36:38] (03CR) 10Milimetric: "Thanks! I'll merge this tomorrow after I bump the version again, after Ben's upgrade." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/792215 (https://phabricator.wikimedia.org/T307714) (owner: 10Milimetric) [16:40:25] (03PS1) 10Milimetric: Update mediawiki history pipeline [analytics/refinery] - 10https://gerrit.wikimedia.org/r/805446 (https://phabricator.wikimedia.org/T309987) [16:46:59] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10JAllemandou) The gobblin problem is a known issue - we have setup alerts (that worked!) that cover us from this. [17:24:35] 10Data-Engineering, 10Data-Engineering-Kanban: Build Bigtop 1.5 Hadoop packages for Bullseye - https://phabricator.wikimedia.org/T310643 (10BTullis) [17:24:55] 10Data-Engineering, 10Data-Engineering-Kanban: Build Bigtop 1.5 Hadoop packages for Bullseye - https://phabricator.wikimedia.org/T310643 (10BTullis) p:05Triage→03Medium [17:33:16] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:19:59] ok team - ending my day for today - see you tomorrow [18:31:47] 10Data-Engineering, 10Airflow: SparkSubmitOperator should make it easier to use conda dist envs - https://phabricator.wikimedia.org/T307937 (10Ottomata) Since @mforns is working on making `run_dev_instance.sh` support keytabs, I'm okay with merging. See https://gitlab.wikimedia.org/repos/data-engineering/airf... [20:40:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: Add better support for using Event Platform streams with the Flink DataStream API - https://phabricator.wikimedia.org/T310302 (10Ottomata) Looking into how to use the TypeInformation we ca... [21:20:46] (03PS1) 10Milimetric: Add scatter and bar charts [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/805481 [23:40:27] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook