[04:29:08] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:40:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:41:30] elukey: Many thanks for that. I will look into the space on an-airflow1001 now. [08:53:16] !log rebooting an-master1002 - T304938 [08:53:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:07:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:08:49] --^ I have acked the alert - I'm going to do a hive failover today anyway as part of the scheduled reboots - so this will go away. [09:09:38] First I'm going to do a manual failover of HDFS and YARN to an-master1002 - so that I can reboot an-master1001 [09:10:57] (Actually YARN doesn't need a manual failover) [09:13:20] !log switching HDFS services to an-master1002 [09:13:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:30:02] !log 2nd attempt to switch HDFS services to an-master1002 [09:30:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:34:21] We have a missing webrequest partition at the moment. I'm not sure how to handle this. Do I re-run a gobblin job to pull the missing data? Not sure who's around to advise. [09:36:08] https://usercontent.irccloud-cdn.com/file/KuhI7QqP/image.png [09:38:30] Hi btullis [09:38:52] Hiya. I was hoping you were around :-) [09:39:33] !log failover to an-master1002 successful at 3rd attempt [09:39:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:42:35] btullis: the issue seems related to some problem with real traffic (large objects fetching) [09:42:53] btullis: we can rerun the hour with higher error-threshold [09:44:47] joal: OK, that sounds good. Is this an oozie job that executes gobblin to do the fetching? [09:45:01] btullis: this is unrelated to gobblin :) [09:45:16] the error is from webrequest-refine [09:45:21] that is an oozie job [09:45:49] But, we can't just rerun that oozie job, as thresholds need to be overwritten [09:46:05] We need to follow the procedure described here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dealing_with_data_loss_alarms [09:47:37] I can do it if you wish btullis [09:49:02] Oh, right. Sorry. How am I still confused by this? Where does gobblin write to then? Not partitions? [09:49:17] btullis: wanna chat in batcave? [09:49:41] Please, if you have the time. [10:10:15] I have submitted the following job to try to create the missing partition. [10:10:25] https://www.irccloud.com/pastebin/Z6Ig894R/ [10:11:23] Where bundle.properties contains the replaced `oozie.bundle.application.path` with `oozie.coord.application.path = ${coordinator_file}` [10:11:24] btullis: I assume you have modified the bundle.properties file? [10:11:27] yeah :) [10:11:56] Maybe I should have stopped and double-checked with you :-) [10:12:06] no worries - all good :) [10:25:06] !log restarting the `check_webrequest_partitions` service on an-launcher1002 [10:25:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:26:55] RECOVERY - Check unit status of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:27:06] success btullis :) --^ [10:28:53] Great. Now do I re-run the three other oozie jobs , right? webrequest-druid-daily-coord, webrequest-druid-hourly-coord, webrequest-load-coord-upload ? [10:32:25] hm, I don't think btullis - the just sent us SLAs, no failure [10:33:44] actually the webrequest-druid-hourly-coord already caught up, and the daily one is running [10:34:38] Oh right, so they would wait for the partitions to be available, then begin work, but send us an SLA_MISS is they're waiting for a while? [10:35:05] absolutely - the SLA email is sent after the job has been waiting for a configured time [10:40:58] OK, got it. I'm a bit confused by the `Expected Duration (in mins) - -1`and `Actual Duration (in mins) - -1` bits in the emails from oozie. [11:00:13] I'm about to go for a reboot of an-master1001 - HDFS services are already transferred to an-master1002 - YARN and Namenode services should just failover automatically. [11:01:11] !log rebooting an-master1001 [11:01:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:32:31] 10Analytics-Clusters, 10Data-Engineering, 10Product-Analytics, 10Superset, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10jbond) >>! In T275575#7823142, @razzi wrote: > @jbond I see you've worked on our identity provider, is... [11:36:02] OK, here's a strang problem. After rebooting an-master1001 its YARN prometheus metrics seem to be incomplete. Compare these. [11:36:07] https://www.irccloud.com/pastebin/eKs8DYcm/ [11:36:18] https://www.irccloud.com/pastebin/ZgrU4UPV/ [11:38:15] Investigating this now, but in the meantime an-master1002 is still the active node for HDFS. [11:47:02] Oh, I think it's OK. The metrics are only available on the active resourcemanager. [11:49:42] !log restarted hadoop-yarn-resourcemanager on an-master1002 to force the active role back to an-master1001 [11:49:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:51:02] !log failing back hdfs active role to an-master1001 [11:51:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:35:39] q [12:35:45] oops :) [13:30:20] o/ [13:31:04] 10Data-Engineering: Add projects to sqoop list when synced in clouddb - https://phabricator.wikimedia.org/T304632 (10Snwachukwu) guwwiki cloud db is not yet synchronized. ` ebysans@stat1004:~$ mysql --database guwwiki_p -h clouddb1021.eqiad.wmnet -P 3315 -u s53272 -p Enter password: Welcome to the MariaDB moni... [14:05:26] 10Data-Engineering, 10SRE, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10BTullis) cc @Milimetric and @Ottomata who probably know the most about the current behaviour regarding `wprov` and mobile vs desktop view recording. [14:12:29] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Stop ingesting data to the old AQS cluster - https://phabricator.wikimedia.org/T302276 (10BTullis) A decision has been taken to migrate the cassandra3 loading to Airflow, rather than modify the current oozie job... [14:13:22] 10Data-Engineering, 10Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Final cleanup tasks related to the AQS cluster migration - https://phabricator.wikimedia.org/T302278 (10BTullis) Removing this from Kanban until the rependent tasks have been resolved. [14:57:56] (03PS3) 10Snwachukwu: Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) [15:03:54] ottomata, mforns, milimetric - standup? [15:20:19] 10Analytics, 10Analytics-Jupyter, 10Data-Engineering: Autocomplete is very slow (unusable) in Newpyter - https://phabricator.wikimedia.org/T290008 (10JArguello-WMF) [15:23:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: Mediawiki History Dumps - https://phabricator.wikimedia.org/T300344 (10JArguello-WMF) [15:48:37] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10JArguello-WMF) [16:06:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye - https://phabricator.wikimedia.org/T306148 (10JArguello-WMF) [16:11:56] 10Analytics-Wikistats, 10Data-Engineering, 10Browser-Support-Opera: Opera 15+ seems not to be recognized correctly - https://phabricator.wikimedia.org/T61816 (10JArguello-WMF) p:05Low→03High This relates to Data Quality, that's why the priority changed. [16:15:29] 10Analytics-Wikistats, 10Data-Engineering, 10Browser-Support-Opera: Opera 15+ seems not to be recognized correctly - https://phabricator.wikimedia.org/T61816 (10Antoine_Quhen) a:03Antoine_Quhen [17:26:15] 10Analytics-Wikistats, 10Data-Engineering, 10Browser-Support-Opera: Opera 15+ seems not to be recognized correctly - https://phabricator.wikimedia.org/T61816 (10JAllemandou) Can you please provide more information, such as the data you look at, and what you'd be expecting? [17:32:17] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10JAllemandou) I support keeping two users to separate loading from accessing but it's not a strong opinion - let me know if it's too much of a bu... [17:44:59] 10Data-Engineering: Update ua-parser library for traffic data - https://phabricator.wikimedia.org/T306829 (10JAllemandou) [17:56:37] 10Data-Engineering, 10Data-Engineering-Kanban, 10Beta-Cluster-Infrastructure, 10Event-Platform, 10Patch-For-Review: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) - https://phabricator.wikimedia.org/T304433 (10Ottomata) [18:09:30] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:33:40] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Pageview definition relies on X-Analytics to determine special pages - https://phabricator.wikimedia.org/T304362 (10mforns) Today, at our grooming meeting, we were trying to evaluate whether this task was a "must do", a "should do", or a "cou... [18:34:00] (03PS13) 10Sharvaniharan: Android schemas migrated from legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/778603 [18:38:53] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:11:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Beta-Cluster-Infrastructure, 10Event-Platform: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) - https://phabricator.wikimedia.org/T304433 (10Ottomata) [19:13:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Beta-Cluster-Infrastructure, 10Event-Platform: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) - https://phabricator.wikimedia.org/T304433 (10Ottomata) Update: all nodes have been replaced with either bullsye or buster! O... [19:16:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Beta-Cluster-Infrastructure, 10Event-Platform: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) - https://phabricator.wikimedia.org/T304433 (10Ottomata) Ah! I just needed to add the correct firewall security group. It works! [19:21:59] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10RobH) [19:22:20] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10RobH) [20:08:09] (03CR) 10Ottomata: [C: 03+2] Modify ios_notification_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/784293 (https://phabricator.wikimedia.org/T290920) (owner: 10Tsevener) [20:09:13] !log dropping event.ios_notification_interaction hive table and data for backwards incompatible schema change in T290920 [20:09:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:09:16] T290920: Create schema to track metrics for user notifications on iOS - ios_notification_interaction - https://phabricator.wikimedia.org/T290920 [21:08:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Pageview definition relies on X-Analytics to determine special pages - https://phabricator.wikimedia.org/T304362 (10Krinkle) >>! In T304362#7878093, @mforns wrote: > […] > I executed a quick query to proxy the proportion of pageviews affected...