[00:32:09] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:36:34] (03CR) 10Jenniferwang: "Please see my answers in lines." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [02:42:27] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Lots of work been happening. Talked with @JAllemandou about some keytab stuff we still have to figure out: I can make Skein + spark-submit + keytabs work if I upload the key... [03:00:40] 10Analytics, 10Analytics-Wikistats: Add "Interwicket" to the list of bots - https://phabricator.wikimedia.org/T154090 (10nshahquinn-wmf) 05Open→03Declined Wikistats 1 has been shut down, and Wikistats 2 has no list of unflagged bots that this could be added to. [05:04:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [05:09:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:25:40] 10Data-Engineering, 10Data-Catalog: Connect remaining Data Sources to the MVP [Mile Stone 5] - https://phabricator.wikimedia.org/T299899 (10odimitrijevic) p:05Triage→03Medium [06:28:54] 10Data-Engineering, 10Product-Analytics: Investigate easier methods for WMF staff to access Superset - https://phabricator.wikimedia.org/T258962 (10odimitrijevic) [08:07:26] (03CR) 10Awight: Bug: T299007 Add the mediawiki_reading_depth event platform stream to the allowlist (0310 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang) [11:29:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10EChetty) [11:29:29] !log btullis@puppetmaster1001:~$ sudo -i confctl select name=aqs1011.eqiad.wmnet set/pooled=yes [11:29:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:30:06] !log pooled aqs1011 T298516 [11:30:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:30:08] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [11:39:50] 10Data-Engineering, 10Airflow: Low Risk Ozzie Migration: wikidata _metrics - https://phabricator.wikimedia.org/T300021 (10EChetty) [11:42:31] 10Data-Engineering, 10Airflow: Low Risk Ozzie Migration: wikidata _metrics - https://phabricator.wikimedia.org/T300021 (10EChetty) [11:44:51] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10EChetty) [11:49:52] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) aqs1011 has been pooled this morning [[https://sal.toolforge.org/log/Jzz_kH4B1jz_IcWuey4j|at 11:29]]... [11:53:50] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: interlanguage - https://phabricator.wikimedia.org/T300025 (10EChetty) [11:54:26] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10EChetty) [11:55:06] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T300027 (10EChetty) [11:55:32] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10EChetty) [11:55:52] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: session length - https://phabricator.wikimedia.org/T300029 (10EChetty) [12:00:41] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: session length - https://phabricator.wikimedia.org/T300029 (10EChetty) [12:00:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10EChetty) [12:01:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10EChetty) [12:01:37] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10EChetty) [12:01:39] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10EChetty) [12:01:41] 10Data-Engineering, 10Airflow: Low Risk Ozzie Migration: wikidata _metrics - https://phabricator.wikimedia.org/T300021 (10EChetty) [12:01:43] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: interlanguage - https://phabricator.wikimedia.org/T300025 (10EChetty) [12:01:45] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10EChetty) [12:14:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10EChetty) [12:20:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Migrate AQS hourly job - https://phabricator.wikimedia.org/T299398 (10EChetty) [12:20:48] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T300027 (10EChetty) [12:31:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) I have now realised that this build process is considerably more convoluted than I had anticipated, but it is progressing. I've switched my development host from... [13:00:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) == Karapace Setup == === Build egg === ` cd src/datahub git clone https://github.com/aiven/karapace.git && cd karapace git checkout 2.0.1 python3 setup.py bdist_... [13:05:37] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JAllemandou) [13:06:51] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10JAllemandou) [13:07:29] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: session length - https://phabricator.wikimedia.org/T300029 (10JAllemandou) [13:08:35] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10JAllemandou) [13:16:21] 10Data-Engineering, 10Airflow: Low Risk Ozzie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10JAllemandou) [13:19:00] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10JAllemandou) [13:34:36] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Define and implement archiving for Airflow - https://phabricator.wikimedia.org/T300039 (10JAllemandou) [13:39:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Define and implement archiving for Airflow - https://phabricator.wikimedia.org/T300039 (10JAllemandou) [14:01:28] 10Data-Engineering, 10Airflow: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T300027 (10JAllemandou) [14:01:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T299398 (10JAllemandou) [14:01:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T299398 (10JAllemandou) [14:07:01] btullis: joal razzi so yeah, it looks like the hive_metastore db is messed up [14:07:05] who knows what else. [14:07:14] on test cluster [14:07:28] so, i think we should wipe it. maybe other things are fine. [14:08:43] oh yeah hm, mysql is even flapping there now [14:08:56] hm yeah druid db is messed up too' [14:09:06] ottomata: Yep. I agree. Have we ever had to do this before? Any guidelines with which you're familiar? [14:09:16] for druid, i'm not sure [14:09:37] for hive it should be relatively simple. drop the database, recreate, and then i'm sure there's an init step [14:09:40] looking for instructinos [14:09:59] https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.0/bk_command-line-installation/content/validate_installation.html [14:11:01] making an etherpad [14:11:19] https://etherpad.wikimedia.org/p/an-test-coord1001-srv-recovery [14:12:15] I'd feel better if we could just start with a fully clean mysql [14:12:17] lets see [14:12:30] so we need to account for [14:12:42] druid, hive_metastore, hue_test, oozie [14:12:49] lets see if we can mysql dump each of those first [14:12:50] Yeah, I think I agree. But maybe we could dump the `mysql` tables so that we don't need to set up the grants manually? [14:12:52] some of thoes might be fine [14:12:55] ! :) [14:12:57] oh grants [14:13:01] hmm yeah lets try it all [14:13:29] individually though for each db [14:13:37] will put together some stuff in the etherpad [14:13:42] Do you want to jump on a call? Or are you happier like this? [14:13:53] lets jump on shortly, before we start doing things [14:14:00] just going to try to get together a little plan first [14:15:22] btullis: we don't have a replica, right? that's an-test-coord1002 which is still in hardware limbo? [14:15:49] That's correct. It's in `insetup` now, but not a replica yet, unfortunately. [14:22:22] haha i was looking for that command just now thank you ben! [14:24:45] btullis: for grants, do we want to just restore the grants table? or the whole mysql db? [14:24:59] i guess the mysql db will not be correct, since it has buncha metadata about tables [14:26:03] btullis: i'm going to go ahead and stop these services, they are causing mysql to flap when they do things [14:26:17] we should stop druid too.. [14:26:43] Yeah, fine by me. I'm looking into `pt-show-grants` to backup restore the grants to a fresh mysql db. [14:27:27] ok [14:29:25] !log stopping druid* on an-test-druid1001 - T299930 [14:29:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:29:28] T299930: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 [14:29:33] (after downtiming an-test-druid1001) [14:30:22] !log stopping services on an-test-coord1001 - T299930 [14:30:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:30:52] oh hue isn't on an-test-coord...? [14:31:07] an-test-ui1001.eqiad. [14:33:04] oh presto too [14:34:04] What was the hue_test database for? [14:34:06] heya ottomata :] what part of airflow you want to demo today? [14:34:26] OH crapo we have planning grr, maybe we can get an-test-coord fixed in 26 mins! [14:34:39] mforns: i'm going to wing the artifact, conda and skein spark stuff [14:34:39] ? [14:34:44] anything else you want to do? [14:34:45] or [14:34:52] you can do any of that if you like [14:34:58] Yep, 26 minutes. No probs. :-) [14:34:59] ok btullis lets jumpin call [14:35:06] bc [14:35:23] no no, ottomata, please, showcase the artifact, conda and skein/spark stuff [14:35:56] ottomata: is it OK if I give a more generic summary and demo first, and then hand it over to you, since your part is more technical? [14:40:29] okay [15:31:47] 10Data-Engineering, 10Airflow: Add data-quality to airflow DAGs' name - https://phabricator.wikimedia.org/T300054 (10JAllemandou) [15:32:09] mforns: I just created that --^ [16:06:46] 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10ops-monitoring-bot) Icinga downtime set by razzi@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: Still troubleshooting mariadb issues ` an-test-coo... [16:09:52] 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10razzi) 05Resolved→03Open After I messed up the /srv volume, we attempted a restore from a dump but there are issues. Notes for the recovery are in https://etherpad.wi... [16:11:32] 10Quarry, 10cloud-services-team (Kanban): Do some checks of how many Quarry queries will break in a multiinstance environment - https://phabricator.wikimedia.org/T267989 (10nskaggs) [16:12:46] 10Data-Engineering, 10Data-Services, 10Documentation, 10cloud-services-team (Kanban): Document on wikitech the general process of getting a table/column exposed to Wiki Replica users - https://phabricator.wikimedia.org/T209992 (10nskaggs) a:05Bstorm→03None [16:13:37] 10Quarry, 10cloud-services-team (Kanban): Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10nskaggs) a:05Bstorm→03None [16:18:48] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) [16:38:21] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) [16:42:22] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) [16:56:42] (03PS1) 10MNeisler: Add 2010 wikitext editor option to edit_hourly interface field [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757035 (https://phabricator.wikimedia.org/T293406) [16:58:05] 10Data-Engineering, 10MediaWiki-Page-editing, 10Editing-team (Tracking), 10Patch-For-Review, and 2 others: Update edits_hourly to ingest new legacy wikitext editor change tag - https://phabricator.wikimedia.org/T293406 (10MNeisler) [19:19:49] razzi: i'm just going to half pay attention to this talk, want to continue work on an-test-coord together? [19:20:13] btullis: you said you dumped grants, i assume you did not load any of them? [19:23:30] Correct. I added the `--drop` argument, but it doesn't do `DROP USER IF EXISTS`, which seems a bit silly. [19:23:50] ok, i think that's ok [19:24:19] btullis: , ok if i proceed? (I think i've heard this talk before, so i can skip it) [19:24:27] i can jump in a call if you prefer (i know it is late for you) [19:24:57] Sorry I can't be around right now. In a Welsh lesson on Zoom. Not really paying much attention. Please do proceed. [19:25:01] yeah no prob [19:25:10] ok proceeding, razzi if you are around and want to do it together let me know [19:26:27] ah yeah i see, adding if user exists to file [19:33:36] ok great, hive and oozie looking good [19:33:45] going to restart refine and gobblin jobs [19:33:53] hmm no [19:33:56] lets get druid fixed first [19:37:22] !log reseting test cluster druid via druid reset-cluster https://druid.apache.org/docs/latest/operations/reset-cluster.html - T299930 [19:37:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:37:25] T299930: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 [19:43:57] ^ didn't work, it doesn't run? [19:44:12] i thinkk all i really need is to wipe the data dirs on an-test-druid1001, since the meta db is already gone [19:46:45] !log removing hdfs druid deep storage from test cluster [19:46:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:28:56] (03PS26) 10AGueyte: WIP: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [20:30:20] (03CR) 10jerkins-bot: [V: 04-1] WIP: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [20:30:58] (03CR) 10AGueyte: WIP: Basic ipinfo instrument setup (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [20:31:33] 10Data-Engineering, 10Data-Engineering-Kanban: Consider resizing an-test-coord1001 partitions - https://phabricator.wikimedia.org/T299930 (10Ottomata) OKAY! I believe we are back in business on an-test-coord1001. Hive and Oozie databases were able to be restored, which is great! We didn't have to manually s... [20:32:26] (03PS27) 10AGueyte: WIP: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [20:34:30] joal: i think you are gone for the day, but if youu aren't lets talk skein kerberos again? [21:19:19] (03PS1) 10GoranSMilovanovic: T294983 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/757104 [21:19:43] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T294983 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/757104 (owner: 10GoranSMilovanovic) [21:22:25] Thanks for doing the reset ottomata !! [21:23:17] How's it going with druid and gobblin and refine etc? [21:24:29] btullis: haven't been looking closely but i think so far so good [21:25:53] Cool. Let me know if there's anything left over that you'd like me to look at when I get here in the morning. Or ping me if I can help in the meantime. [21:32:53] (03CR) 10Jenniferwang: "Hi Ottomata, Joal," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/753178 (https://phabricator.wikimedia.org/T299007) (owner: 10Jenniferwang)