[00:06:38] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:20:44] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:54:21] 10Quarry, 10Cloud-Services: Quarry links to IRC are broken - https://phabricator.wikimedia.org/T283773 (10Legoktm) @Framawiki, @zhuyifei1999, @bstorm, could one of you deploy the above change? [04:32:22] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi thanks - let me know when I can proceed [05:28:28] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `db1183.eqiad.wmnet` - db1183.eqiad.wmnet (**PASS**) - Downtimed... [06:24:23] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` db1183.eqiad.wmnet ` The log can be found in `/var/log... [06:29:32] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1183.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1183.eqiad.wmnet'] ` [06:39:03] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts: ` dbstore1007.eqiad.wmnet ` The log can be found in `/va... [07:01:54] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbstore1007.eqiad.wmnet'] ` and were **ALL** successful. [07:12:27] razzi: re: increase memory/cpu for ganeti vms - it can be done afterwards if needed (but requires a vm reboot etc..) [07:16:52] ottomata: re: launcher + airflow, I have some doubts, for example how airflow will deal with the big workload that sqoop generates every first week of the month (regular timers are ok, so maybe airflow will be too). Having it on an-coord could be another option, it would also have a natural failover if needed (probably manual puppet change to an-coord1002) but a VM is also good (so we'll get a [07:16:58] dedicated stack as the other teams, etc..). It is also easy to start from launcher and move to some other place if needed, so you folks decide :) [07:27:46] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) >>! In T283190#7119438, @Marostegui wrote: > @schoenbaechler can you please confirm you've read and s... [07:44:42] * elukey afk again [08:05:48] 10Analytics: Reportupdater SQL jobs failing with Python error - https://phabricator.wikimedia.org/T284074 (10awight) [09:07:43] 10Quarry, 10Cloud-Services: Quarry links to IRC are broken - https://phabricator.wikimedia.org/T283773 (10Framawiki) Deployed, thanks for the patch. [09:08:34] 10Quarry, 10Cloud-Services: Quarry links to IRC are broken - https://phabricator.wikimedia.org/T283773 (10Framawiki) 05Openβ†’03Resolved [10:51:15] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Thanks @razzi - could you or @elukey let me know if I can stop this host? (Given it is the start of the month, not sure if it is being used) [12:59:35] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) No go for this week :( [13:06:19] elukey: I think airflow is going to be superr easy to move around, and I like that an-launcher woud lhave it sit next to all the other timers, etc. [13:07:04] if the sqoop workload is a problem, we can move it, but I don't think airflow is going to have a lot of overhead, esp not at first [13:07:22] elukey: if you are there, can we chat about analytics-meta databases and replication? [13:07:45] i read the ticket about ROW based, and the docs, but am not totally sure what is needed to create a new db on an-coord1001 [13:18:32] ottomata: it is public holiday for me, I'll attend the mysql training later on but we can chat tomorrow if you are ok [13:19:13] but if it is quick we can do it now in here [13:19:32] I left some notes in the gerrit patch about replication etc.. [13:19:55] if you need to create a new db on an-coord1001, it will be nicely replicated elsewhere without any effort [13:20:09] but I think that users/grants need to be set on 1002 and db1108 [13:20:28] (otherwise replication might break in some use cases) [13:23:09] oh! [13:23:33] ok thanks elukey that's mostly what I was wondering, so creating db is good, just have to do the grants everywhere [13:23:42] elukey: have a good holiday! [13:24:58] I am around if you have more questions, but it is really easy [13:24:58] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10Ottomata) Approved! no ssh or kerberos needed. [13:25:29] I left a note in the gerrit patch about checking mariadb max conns and innodb buffer before adding more dbs on meta [13:25:57] creating an airflow dedicated instance may be good as well to separate concerns [13:26:50] 10Analytics, 10LDAP-Access-Requests, 10SRE, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Ottomata) Hi! Approved! It'd be nice if the ticket explained the reason for the access request, not just the services wanted. @Maryana or @Kgordon c... [13:26:51] it happened once IIRC that an-airflow1001 created a ton of db conns for some reason that ended up affecting the rest, I luckily bumped the max conns in time [13:27:57] we should have 250 atm, and airflow 2.x should create less db conns, but having a lot of dbs with different workloads in the same place may become a problem sooner or later [13:29:29] oh elukey i know the question i wanted to ask [13:29:33] 2 [13:29:44] 1. is replication setup manual? [13:30:22] i.e. there is nothing in puppet necessarily configuring replication, right? puppet just sets up the dbs and then replicaiton is started manually? [13:31:04] yes correct [13:31:07] 2. i can't figure out how multi instance mariadb works. I see mariadb::instance uses systemd @ templating, but there don't seem to be any service resource declared for the instances in puppet. Are they just started manually and not managed by puppet? [13:32:18] they are started manually (it is a design choice from data persistence) [13:32:24] weird [13:32:43] I think it makes sense, it gives a lot of control to the admin that stops/reboots/etc.. [13:32:55] I like the idea of dedicated mysql instance(s) for airflow, woudl be nice if we could wait for the new hardware. [13:32:55] hm [13:33:36] i think we should do a big refactor of an-coord mysql when we get the new dedicated db hardware for it in q1 [13:33:54] hm, but if we do it for a separate airflow isntance now, perhaps it will be easier to do then [13:35:16] in theory it should be easy to stop the new instance when the new hw arrives, copy everything to the new host(s) and point the db1108 replica to the new host [13:36:02] it would mean stopping all the airflow schedulers but it is not a huge deal (I mean tolerable maintenance) [13:40:25] hm yeah, maybe simpler than tuning 2+ mysql instances now on an-coord1001? [13:42:57] hmmm, actually...if we didn' thave the mysql dbs on an-coords, we could and would probably run airflow there? [13:43:06] could even think about HA.... [13:43:08] meh i dunno [13:43:09] ok [13:56:38] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10cmadeo) Thanks @ottomata + @colewhite! Do I still need approval from @lucyblackwell ? [13:57:30] heya! [13:59:32] 10Analytics-Radar, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Bumeh-ctr) Successfully logged in to Superset! [14:00:15] 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10Ottomata) [14:00:17] 10Analytics: Analytics coordinator failover improvements - https://phabricator.wikimedia.org/T280905 (10Ottomata) [14:09:58] 10Analytics: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata) [14:12:40] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10lucyblackwell) I Lucy Blackwell, Carolyn’s manager approve this! [14:16:47] 10Analytics, 10LDAP-Access-Requests, 10SRE, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Kgordon) [14:17:07] 10Analytics, 10LDAP-Access-Requests, 10SRE, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Kgordon) @Ottomata, done! Thank you very much! [14:20:08] milimetric or joal, do you have 10-ish mins to brainbounce on deletion script again? 😬 [14:20:48] I think we're both watching children [14:21:21] definitely after standup or maybe earlier depending on when our nanny gets in [14:21:37] mforns: i can brainbounce if you like! [14:24:23] milimetric: don't worry! ottomata yesss please :D [14:25:40] bc? [14:26:21] mforns: gimme 5 sorry [14:26:28] ottomata: no problemo! [14:47:32] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) No worries - I will ping again on Monday next week [15:12:44] (03PS1) 10Fdans: Change state to allow more than one project [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/697797 (https://phabricator.wikimedia.org/T283624) [15:40:57] 10Analytics, 10Analytics-Kanban: Failures registered by drop_event on an-launcher1002 - https://phabricator.wikimedia.org/T283126 (10Ottomata) So, bad output contains an error like: ` ERROR Unable to move file /tmp/analytics/hive.log.2021-06-01 to /tmp/analytics/hive.log.2021-06-01 ` Looking at that file, I... [15:47:53] mforns: ....any reason drop-older-than can't search / execute in parallel? [15:47:58] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10Htriedman) @JBennett tagging you to flag that you need to sign off on this Re: Contract expiration, it's set to expire at t... [15:48:15] ottomata: I don't think so [15:48:24] I mean, no reason [15:48:41] hm... [15:49:10] at least within a given dataset [15:49:39] I mean, I think we can parallelize datasets, but not sure we can parallelize within each dataset [15:50:00] oh.. [15:50:10] right [15:50:13] just each dataset [15:50:21] ottomata: but is it worth? [15:50:31] maybe? [15:50:42] i dunno maybe not maybe we can makme airflow do that eventually [15:52:07] ottomata: you mean that to reduce the time to query the hive metastore? [15:52:09] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:56:25] mforns, just reduce the time it takes to run the script [15:56:34] i'm runniing drop event manuall now [15:56:40] aha [15:56:41] 10 minutes in so far [15:56:50] doing 1 dataset at a time [15:57:09] do you think Hive metastore will take the queries better if they are in parallel? [15:57:14] all at the same time? [15:57:22] i don't think it will matter as long as they are not hugely parallel [15:57:29] we coudl do 10ish at a time and speed things up a lot [15:57:33] aha [15:59:44] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10JBennett) Approved from my end. [16:04:02] fdans: standup? [16:04:08] razzi: standup? [18:01:33] mforns: FYI drop_event takes about 1.5hrs to run right now :) [18:02:06] ottomata: regular run? just 1 day of data? [18:02:43] regular run, i think half a day late [18:02:48] or a little more [18:02:57] so maybe something like 35 hours of data [18:04:19] is this an issue? [18:08:54] mforns: no, its fine [18:09:24] i betcha a simple change would make it run in a few minutes instead of hours, buuuut probably not worth the possible bugs we might introduce [18:09:31] esp if we switch to airflow one day [18:09:34] ottomata: I think we'll revisit all this with airflow no? [18:09:37] yea [18:09:40] !log remove .deb packages from stat boxes: python3-mysqldb python3-boto python3-ua-parser python3-netaddr python3-pymysql python3-protobuf python3-unidecode python3-oauth2client python3-oauthlib python3-requests-oauthlib python3-ua-parser - T275786 [18:09:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:09:45] T275786: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 [18:20:48] 10Analytics: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10Ottomata) [19:30:48] 10Analytics: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10Ottomata) Steps to create a local conda airflow development environment: 1. Install miniconda. We're using python 3.7 on the cluster, so let's use that for our dev env too. Download one of the py37 installers... [19:54:38] * razzi afk for lunch [20:53:20] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10nshahquinn-wmf) [20:53:25] 10Analytics, 10Event-Platform, 10Inuka-Team: InukaPageView Event Platform Migration - https://phabricator.wikimedia.org/T267344 (10nshahquinn-wmf) 05Openβ†’03Resolved I confirmed that the stream was migrated successfully in T283768. [20:53:37] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10nshahquinn-wmf) [20:53:46] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFirstRun Event Platform Migration - https://phabricator.wikimedia.org/T267346 (10nshahquinn-wmf) 05Openβ†’03Resolved I confirmed that the stream was migrated successfully in T283768. [20:53:55] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10nshahquinn-wmf) [20:54:05] 10Analytics, 10Event-Platform, 10Inuka-Team: KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10nshahquinn-wmf) 05Openβ†’03Resolved I confirmed that the stream was migrated successfully in T283768. [22:07:15] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10colewhite) [22:10:45] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10colewhite) [22:13:38] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to SUPERSET for CMADEO - https://phabricator.wikimedia.org/T284109 (10colewhite) 05Openβ†’03Resolved The group membership change has been deployed. Please feel free to reopen if you encounter any related issue. [22:17:53] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10colewhite) [22:18:06] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10colewhite) [22:23:02] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10colewhite) 05Openβ†’03Resolved a:05Htriedmanβ†’03colewhite The group membership change has been deployed. Please feel f... [22:40:37] 10Analytics, 10LDAP-Access-Requests, 10SRE, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10colewhite) [22:42:20] 10Analytics, 10LDAP-Access-Requests, 10SRE, 10Patch-For-Review: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10colewhite) 05Openβ†’03Resolved a:03colewhite The group membership change has been deployed. Please feel free to reopen if you encounter any relate... [22:42:29] 10Analytics-Radar, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10leila) I'll call it resolved and if someone disagrees they can reopen it. [22:42:33] 10Analytics-Radar, 10Article-Recommendation: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10leila) 05Openβ†’03Resolved [23:32:22] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) #operations. Having trouble gaining access despite approved production access. Verbose output: Last login: Wed Jun 2 16:14:12 on ttys000 janstee...