[03:47:41] Tried to update the mediawiki history snapshot, but the `sudo cookbook sre.aqs.roll-restart aqs` command failed for the same reason as Ben's druid cookbook, namely a quote in the "reason" field of the downtime command. Unsure if there's a ticket for tracking this already; fortunately there's a workaround, removing the quote, will do so [04:40:09] Assigned that change to Ben, will roll it out tomorrow [04:40:48] Noticed an alert for disk space on an-airflow1001, hopefully not very urgent since airflow is still in an alpha sort of availability, but I'll see if I can figure out what's going on [05:14:08] Looking at /var/log/auth.log, I see a repeating message `airflow run ...` so I think it's just filling up with normal data that it produces, logs etc [06:16:39] razzi: o/ airflow is important for the search, definitely not alpha, I usually ping dcausse when this happens (to trim logs basically, this is the main problem, there is a task IIRC) [06:17:21] Oh right elukey [06:17:31] also it is very late for you so lemme take care of it :D [06:17:34] Hehe [06:18:06] Was just peeking at alerts, off to bed pretty soon [06:18:13] ack :) [07:19:31] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Escape bucket names for grafana [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/711159 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight) [07:22:58] (03CR) 10Awight: [V: 03+2] "Repo doesn't self merge, so manually bumping." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/711159 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight) [07:28:45] (03CR) 10Awight: "@milimetric Just wanted to mention, the code was automatically pulled over to production and I was able to read error logs, thanks for hel" [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578) (owner: 10Awight) [08:00:45] FYI, doing a rolling restart of AQS to pick up the c-ares security update [08:02:12] ack :) [08:03:23] moritzm: qq - what do you use to reboot restbase nodes? We were wondering if it was worth to create a cookbook for it (since we have to do it for AQS) but maybe it is sufficient reboot-single.py [08:07:47] I simply mostly used Cumin; first set downtime for the batch of hosts to be rebooted using the downtime cookbook [08:08:45] and then via cumin "depool; c-foreach-nt-drain; reboot-host" [08:09:05] ah okok, maybe it could be worth to be translated in a cookcook [08:09:10] *cookboot [08:09:12] aaaaahhh [08:09:16] cookbook [08:09:19] haha [08:09:21] (today it is typo day) [08:09:29] yeah, sure. cookbook sounds good for sure [08:09:36] perfect :) [08:09:39] btullis: --^ [08:09:44] John is doing more work on the reboot cookbook framework this Q [08:09:45] (good morning) [08:10:02] and best to already base it on that [08:10:37] for https://phabricator.wikimedia.org/T283067 [08:11:01] makes sense [08:11:01] it'll soon be obsoleted since I'll be uploading the latest Java 8 packages for Buster later today [08:11:13] lol [08:11:36] but with more cookbooks it'll be quicker now :-) [08:12:06] aqs is basically the only part left for analytics, so we are definitely good [08:12:38] ack [09:18:00] Hi, there seems to be some old dump in '/wmf/data/discovery/wikidata/rdf/' from february. These don't conform to recent data (which have a parition called wiki='wikidata' or wiki='commons'). Wanted to point out, in case these need cleaning or something. [09:21:38] !log run "sudo find /var/log/airflow -type f -mtime +15 -delete" on an-airflow1001 to free space (root partition almost full) [09:21:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:25] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Continuing the work to complete this zookeeper migration. Currently druid1003 is the leader. The plan is: * Stop relevant systemd timers and suspend relev... [09:31:21] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) +1 :) [09:45:35] !log btullis@an-launcher1002:~$ sudo systemctl disable eventlogging_to_druid_editattemptstep_hourly.timer eventlogging_to_druid_navigationtiming_hourly.timer eventlogging_to_druid_netflow_hourly.timer eventlogging_to_druid_prefupdate_hourly.timer [09:45:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:48:42] btullis: lovely day for another zookeper swap :D [09:48:50] !log suspended the following oozie jobs in hue: webrequest-druid-hourly-coord, pageview-druid-hourly-coord, edit-hourly-druid-coord [09:48:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:49:39] Yep, this time it's going to go like clockwork, I'm convinced of it :-) [09:51:45] I could get rid of the apostrophe in the cookbook, as razzi did for the AQS cookbook. That might make it a bit quicker/easier. What do you think? I appreciate that the fix is coming from SRE, butthe workaround seems OK too. [09:59:27] yes yes I 1+ed, no problem from my side [10:03:26] 10Analytics, 10SRE, 10Patch-For-Review: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) OpenJDK 8 needs OpenJDK 8 to build itself, I'm currently building an initial package on my laptop to bootstrap this (and import it to component/jdk8), which will... [10:03:55] 10Analytics, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) [10:06:33] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:09:50] ^ looks like this one will need re-running. [10:11:11] ACKNOWLEDGEMENT - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly Btullis This job will need to be re-run. It was caused by work undertaken during: T255148 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:00:37] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:06:29] Hello, good people of Data Engineering! I have a question related to yarn - I want to estimate the usage of storage for WDQS streaming updater that runs on it. How can I do that? [11:13:17] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:13:28] zpapierski: Hi, I would start with something like `hdfs dfs -du -h /wmf/discovery/streaming_updater` from one of the stat boxes. [11:16:53] btullis: thanks, it helps, but I still need to assess the storage used for local storage for specific rocksdb instances . Rocksdb is used per container for each taskmanager [11:17:12] anyway to easily get that, other than somehow getting into a container? [11:17:25] However, you might need to use your service user account to be able to access all of the direcories. Something like `kerberos-run-command analytics-search hdfs dfs -du -h /wmf/discovery/streaming_updater` [11:17:57] nope, wasn't needed, my kerberos authenticated user account was enough [11:19:05] > local storage for specific rocksdb instances - Hmm. Not sure about this one. I've not looked at container storage yet. Will have a read. [11:19:16] thanks! [11:39:16] (03PS1) 10David Caro: Add database autocompletion [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711456 (https://phabricator.wikimedia.org/T287471) [11:41:23] (03CR) 10jerkins-bot: [V: 04-1] Add database autocompletion [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711456 (https://phabricator.wikimedia.org/T287471) (owner: 10David Caro) [11:44:08] zpapierski: I could be barking up the wrong tree, but if you're looking to find the size of your swift containers for this deployment. I started with: `btullis@thanos-fe1001:~$ swift stat --lh updater`which shows 79GB. [11:46:04] No, this I know how to verify and not really what I need [11:46:05] ...but I'm not certain that I'm looking at the right swift container for your particular task. There are several. These are also literally the first `swift` commands that I've run. :-) [11:47:05] That's ok - I figure I can get this data another way (via staging k8s deployment we have) [11:47:13] Thanks for the help :) [11:47:32] You're welcome. [11:54:58] Rolling restart of druid completed. Puppet disabled again on all druid nodes. Patch to switch druid1002 to an-druid1002 prepared. [12:07:29] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:13:53] !log migration of zookeeper from druid1002 to an-druid1002 complete, with quorum and two zynced followers. Re-enabling puppet on all druid nodes. [12:13:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:58:51] (03PS2) 10David Caro: Add database autocompletion [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711456 (https://phabricator.wikimedia.org/T287471) [13:13:23] (03CR) 10David Caro: "I'll leave the approval to someone else as I don't know what was agreed in regards to packaging and such. Some comments though." [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [13:32:13] Second rolling restart of druid complete. Preparing third patch to move zookeeper on druid1003 to an-druid1003. [13:46:10] nice! [13:46:33] I just had a chat with Brandon about the LVS on analytics-vlans subject, going to open a task for the options [13:55:01] Ah, great. Would be interested in that. I've used an interesting load-balancer setup recently with LVS running on the back-end hosts themselves, with a floating IP. Might be worth discussing it in the mix as well. [16:34:31] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) All three zookeeper servers have been migrated to an-druid100[1-3]. I have re-enabled the systemd timers and resumed the jobs in hue. Now I can think about... [16:37:24] (03CR) 10Bstorm: "I love this approach!" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711456 (https://phabricator.wikimedia.org/T287471) (owner: 10David Caro) [16:40:30] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster - https://phabricator.wikimedia.org/T275767 (10Cmjohnson) [16:41:06] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) 05Open→03Resolved I am going to resolve this task because the relocation is complete. [16:45:22] (03CR) 10Bstorm: Add database autocompletion (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711456 (https://phabricator.wikimedia.org/T287471) (owner: 10David Caro) [16:56:30] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster - https://phabricator.wikimedia.org/T275767 (10BTullis) a:05razzi→03BTullis [16:59:20] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster - https://phabricator.wikimedia.org/T275767 (10BTullis) p:05Triage→03Medium [18:11:49] (03CR) 10Bstorm: upgrade quarry to python 3.7 (031 comment) [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [18:12:44] (03CR) 10Bstorm: [C: 03+1] "If I46edcb235a6a3382e68632ce648ace242c561430 implements the changes suggested by dcaro and me, this patch will be unnecessary." [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711211 (owner: 10Andrew Bogott) [19:21:57] (03CR) 10Michael DiPietro: upgrade quarry to python 3.7 (032 comments) [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [19:28:19] (03CR) 10Andrew Bogott: upgrade quarry to python 3.7 (031 comment) [analytics/quarry/web] (buster) - 10https://gerrit.wikimedia.org/r/711208 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [19:30:12] !log btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable [19:30:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:37:44] !log btullis@druid1003:~$ sudo systemctl stop druid-broker && sudo systemctl disable druid-broker [19:37:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:40:17] !log btullis@druid1003:~$ sudo systemctl stop druid-coordinator && sudo systemctl disable druid-coordinator [19:40:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:41:10] !log btullis@druid1003:~$ sudo systemctl stop druid-historical && sudo systemctl disable druid-historical [19:41:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:43:40] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) I have disabled the middlemanager on druid1003 with the following command. ` btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/work... [19:43:54] !log btullis@druid1003:~$ sudo systemctl stop druid-overlord && sudo systemctl disable druid-overlord [19:43:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:03:32] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Interesting. It is clearly moving nodes away from druid1003 but slowly and I can't tell when it will finish. I have discovered that I can use the "dynamic... [20:04:11] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) {F34590841} [20:07:28] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) ` btullis@an-druid1001:~$ curl -s http://localhost:8081/druid/coordinator/v1/config|jq . { "millisToWaitBeforeDeleting": 900000, "mergeBytesLimit": 524... [20:08:51] 10Analytics: write permission for analytics-privatedata-users - https://phabricator.wikimedia.org/T288657 (10Iflorez) [20:10:37] 10Analytics, 10Product-Analytics: write permission for analytics-privatedata-users - https://phabricator.wikimedia.org/T288657 (10mpopov) p:05Triage→03High [20:11:11] 10Analytics, 10Product-Analytics: write permission for analytics-privatedata-users - https://phabricator.wikimedia.org/T288657 (10mpopov) [20:12:35] (03PS9) 10Michael DiPietro: add stop query function [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/710067 (https://phabricator.wikimedia.org/T71037) [20:13:50] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Gracefully terminated the two remaining middlemanagers. ` btullis@druid1001:~$ curl -X POST http://druid1001.eqiad.wmnet:8091/druid/worker/v1/disable && cu... [20:25:26] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [20:25:34] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: write permission for analytics-privatedata-users - https://phabricator.wikimedia.org/T288657 (10BTullis) a:03BTullis Hi Mikhail, I believe that I can help you with this. Before changing any permissions, please can you let me know what you're trying to... [20:29:34] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0 Remaining reports. - https://phabricator.wikimedia.org/T186121 (10odimitrijevic) [20:29:38] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Beta - https://phabricator.wikimedia.org/T186120 (10odimitrijevic) [20:30:46] 10Analytics, 10Platform Engineering Roadmap, 10User-Eevans: Create k8s deployment of AQS 2.0 - https://phabricator.wikimedia.org/T288661 (10Eevans) [20:31:39] 10Analytics, 10Platform Engineering Roadmap, 10User-Eevans: Obtain a security review of AQS 2.0 - https://phabricator.wikimedia.org/T288663 (10Eevans) [20:33:06] 10Analytics, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10Eevans) [20:35:05] 10Analytics, 10Code-Health-Objective, 10Platform Engineering Roadmap, 10User-Eevans: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10Eevans) [20:54:24] 10Analytics, 10Analytics-Wikistats: Add an option to export the current graph into image file - https://phabricator.wikimedia.org/T219969 (10odimitrijevic) [20:54:27] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256 (10odimitrijevic) [20:57:18] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: write permission for analytics-privatedata-users - https://phabricator.wikimedia.org/T288657 (10BTullis) I see from the [[ https://app.slack.com/client/T024KLHS4/CSV483812/thread/CSV483812-1628712975.032500 | Slack thread ]] that the key issue is the time... [21:08:39] 10Analytics, 10Analytics-Wikistats: Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics - https://phabricator.wikimedia.org/T120037 (10odimitrijevic) [21:08:41] 10Analytics, 10Analytics-Wikistats: Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days - https://phabricator.wikimedia.org/T120036 (10odimitrijevic) [21:09:15] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic Closing as a parent task in favor of using project tags. Epic tasks can serve as parent tasks when needed to capture large feature work. [21:16:04] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: write permission for analytics-privatedata-users - https://phabricator.wikimedia.org/T288657 (10mpopov) Hi @BTullis! o/ @Mayakp.wiki was working with @Ottomata on this to change ownership before he went on vacation and there was a slight misunderstanding... [21:17:13] 10Analytics: Remove support for the (deprecated) Druid datasources (in favor of Druid Tables) on Superset - https://phabricator.wikimedia.org/T263972 (10odimitrijevic) @elukey is this task complete? [21:17:45] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Change /user/hive/warehouse/wmf_product.db ownership to iflorez - https://phabricator.wikimedia.org/T288657 (10mpopov) [21:19:23] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Tracking-Neverending: Superset Updates - https://phabricator.wikimedia.org/T211706 (10odimitrijevic) [21:19:25] 10Analytics, 10 Data-Engineering, 10Better Use Of Data, 10Data-Engineering-Kanban, 10Product-Analytics: Upgrade Superset to 1.2 - https://phabricator.wikimedia.org/T288115 (10odimitrijevic) [21:20:08] 10Analytics, 10 Data-Engineering, 10Better Use Of Data, 10Data-Engineering-Kanban, 10Product-Analytics: Upgrade Superset to 1.2 - https://phabricator.wikimedia.org/T288115 (10odimitrijevic) [21:20:11] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10odimitrijevic) [21:20:13] 10Analytics-Clusters, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10odimitrijevic) [21:27:31] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Change /user/hive/warehouse/wmf_product.db ownership to iflorez - https://phabricator.wikimedia.org/T288657 (10BTullis) I have changed the ownership of the directories as requested. ` btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs... [21:29:30] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Change /user/hive/warehouse/wmf_product.db ownership to iflorez - https://phabricator.wikimedia.org/T288657 (10BTullis) For confirmation: ` btullis@an-launcher1002:~$ hdfs dfs -ls -d /user/hive/warehouse/wmf_product.db/gs_pageviews_corrected /user/hive/wa... [21:42:12] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Change /user/hive/warehouse/wmf_product.db ownership to iflorez - https://phabricator.wikimedia.org/T288657 (10mpopov) 05Open→03Resolved Thank you very much! We really appreciate how quickly this was resolved. [22:02:44] 10Analytics-Clusters, 10Analytics-Radar, 10SRE, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10odimitrijevic) @herron can this task be closed out and possibly create a new cleanup the old hosts if this work still needs to be done? [22:42:28] 10Analytics, 10 Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10odimitrijevic)