[00:08:32] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:41:30] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) [05:41:49] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) Patch uploaded, this is only waiting for the ssh key verification. [05:47:33] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) [06:01:48] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Thanks to Keith's work we now have two 5-nodes clusters! \o/ The last step before... [06:19:09] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) One thing that it is not clear to me is why the issue presents itself only with `saveNamespace`, meanwhile when the standby Namenode executes its timer to `fetchImage` from 1001 everything works... [07:01:33] !log roll restart hdfs namenodes to pick up new GC/heap settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/695933 [07:01:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:03:23] will start with the NN on 1002 [07:25:44] 1002 seems good, I'll failover 1001 -> 1002 in a bit [07:27:23] done, let's see how 1002 performs.. I'll let it running for ~20/30 mins just in case, then I'll proceed with 1001's restart + failback [07:27:27] no rush :) [07:33:59] so far nothing weird came up [07:45:52] restarted 1001 [08:04:59] failed back to 1001, all done [08:31:19] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) Very interesting graphs for the timeframes of the saveNamespace issue: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=56&orgId=1&from=1621957037469&to=1621962440236 There is a clear... [13:47:07] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) Approved. FYI to SRE: this is addition to the analytics-privatedata-users posix group without... [13:48:13] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) @schoenbaechler can you get Lucy Blackwell to approve this access here on this ticket? [13:48:20] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) @Ottomata this ticket is assigned to you, will you take care of it? [13:49:09] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:39] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) a:05Ottomata→03None Ah oops, it was assigned to me because the process was not clear, fixed... [13:53:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) @Marostegui mysql Q for ya: https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#s... [13:55:22] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) p:05Triage→03Medium a:03Marostegui @schoenbaechler can you please confirm you've read an... [13:55:26] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) [14:00:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Marostegui) That variables is deprecated in MySQL, so you probably don't want to use it. [14:00:42] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) >>! In T283190#7119409, @Ottomata wrote: > Ah oops, it was assigned to me because the process... [14:01:17] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) [14:04:31] 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10mforns) @CorinnaHillebrand_WMDE, @AbbanWMDE & @gabriel-wmde ping? [14:17:53] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) a:03Marostegui [14:25:33] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) From reading https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_explicit_defaults_for... [14:30:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Marostegui) Yeah, my point is that if you use it now, if you get to upgrade mysql (depending on how hard they remove it)... [14:34:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Luca is the best for docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta#Backup So... [14:35:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) > you'll need to change the config to remove it. Yeah makes sense. I could add a big ol comment around it abou... [14:36:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Marostegui) db1108 is owned by Analytics yeah. So if you change it on the master, I would recommend changing it everywhe... [14:37:28] !log removed Luca's and Tobias' emails from analytics-alerts@ [14:37:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:07] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 (10mforns) [15:05:09] 10Analytics, 10Analytics-Kanban: Traffic anomaly alarms - https://phabricator.wikimedia.org/T267355 (10mforns) [15:05:18] 10Analytics, 10Analytics-Kanban: Traffic anomaly alarms - https://phabricator.wikimedia.org/T267355 (10mforns) [15:05:21] 10Analytics, 10Analytics-Kanban, 10SRE, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) [15:05:28] 10Analytics-Radar, 10User-Elukey: Restoring the daily traffic anomaly reports - https://phabricator.wikimedia.org/T215379 (10mforns) [15:05:31] 10Analytics, 10Analytics-Kanban: Traffic anomaly alarms - https://phabricator.wikimedia.org/T267355 (10mforns) [15:05:41] 10Analytics, 10Analytics-Kanban: Traffic anomaly alarms - https://phabricator.wikimedia.org/T267355 (10mforns) [15:05:43] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Change permissions for daily traffic anomaly reports on stat1007 - https://phabricator.wikimedia.org/T219546 (10mforns) [15:05:52] 10Analytics, 10Analytics-Kanban: Traffic anomaly alarms - https://phabricator.wikimedia.org/T267355 (10mforns) [15:05:54] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add data quality metric: traffic variations per country - https://phabricator.wikimedia.org/T234484 (10mforns) [15:06:05] 10Analytics, 10Analytics-Kanban: Traffic anomaly alarms - https://phabricator.wikimedia.org/T267355 (10mforns) [15:06:10] 10Analytics-Radar, 10Privacy Engineering, 10SRE, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10mforns) [15:12:20] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Bumeh-ctr) >>! In T283648#7117478, @Marostegui wrote: > @Bumeh-ctr can you post your ssh key on wikitech with your bumeh-ctr account on your, user pa... [15:14:08] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @Bumeh-ctr can you edit your wikitech page: https://wikitech.wikimedia.org/wiki/User:Bumeh-ctr (logged in with your Bumeh-ctr account) an... [15:41:47] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Bumeh-ctr) @Marostegui It's done now. I hope I did it as you expected. [15:43:00] 10Analytics, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) That works - thanks! [15:49:53] (03PS1) 10Neil P. Quinn-WMF: Exclude .DS_Store files from repo [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/696559 [16:01:02] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey an-worker1129 has been moved to A2 [16:06:14] razzi, ottomata o/ --^ I think that we can review the task and add some workers [16:06:36] there is one move left to do but we can start anyway [16:33:36] 10Analytics, 10Analytics-Kanban: Requesting Kerberos password for bumeh-ctr - https://phabricator.wikimedia.org/T283710 (10odimitrijevic) a:03Ottomata [16:37:49] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Replace Content::getNativeData() calls with TextContent::getText() in EventLogging - https://phabricator.wikimedia.org/T283671 (10odimitrijevic) a:03Ottomata [16:40:14] 10Analytics: Add ignore success flags option to pageview monthly dumps - https://phabricator.wikimedia.org/T283593 (10odimitrijevic) p:05Triage→03High a:03fdans [16:41:06] 10Analytics: Add ignore success flags option to pageview monthly dumps - https://phabricator.wikimedia.org/T283593 (10odimitrijevic) [16:41:08] 10Analytics, 10Analytics-Kanban: Create monthly job for canonical pageviews - https://phabricator.wikimedia.org/T265732 (10odimitrijevic) [16:42:22] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - https://phabricator.wikimedia.org/T283562 (10odimitrijevic) 05Open→03Invalid No description [16:43:41] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Request to delete test_gsc_* datasets from Druid (& Superset/Turnilo) - https://phabricator.wikimedia.org/T283536 (10odimitrijevic) a:03JAllemandou [16:55:47] (03PS1) 10Neil P. Quinn-WMF: Allow anonymous users in ContentTranslation schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/696584 (https://phabricator.wikimedia.org/T278942) [16:56:54] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10odimitrijevic) Verify if the memory changes fixed the issue during cluster upgrade. If saveNamespace fails use downtime to fix issue and if works proceed with upgrade. [16:57:56] 10Analytics-Radar, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10odimitrijevic) [17:03:05] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) I think that more research is needed about why the service handler queue (60 available threads total) wasn't able to process zkfc health checks. It may be something related to the number of thre... [18:42:27] (03PS2) 10Mforns: Add safety limits to refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433) [20:14:45] 10Analytics, 10Product-Analytics, 10Product-Data-Infrastructure, 10Language-Team (Language-2021-April-June): All events in the contenttranslationabusefilter data stream failing validation - https://phabricator.wikimedia.org/T283872 (10nshahquinn-wmf) Tagging #analytics and #product-data-infrastructure for... [20:31:30] 10Analytics, 10Analytics-Kanban, 10Packaging, 10Patch-For-Review: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) @Volans42 I've manually reinstalled our dev .deb on an-test-coord1001. What makes it needing to be reinstalled? I can remove it again if ne... [20:36:06] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) Ok, so this didn't go as planned, but there were no lasting issues or data loss. The full logs of the day [are here](https://wm-bot.wmflabs.org/liber... [20:38:39] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10Ottomata) Discussed in grooming today. Plan: - Schedule another downtime with the main intention of solving this saveNamespace problem. -- If Luca's recent patches work and saveNamespace just works, t... [21:46:21] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10nshahquinn-wmf) [22:00:12] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10nshahquinn-wmf) [22:00:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10nshahquinn-wmf) Sorry, Yuvi—I removed your username from the description,... [22:00:57] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10nshahquinn-wmf) [22:01:52] 10Analytics, 10Analytics-Kanban, 10Packaging, 10Patch-For-Review: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Volans) @Ottomata nothing is needed AFAICT, APT is happy again, thanks.