[00:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:44] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:44] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:26:44] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:28] !log set "loadByPeriod(P15D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460 [08:02:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:02:32] T337460: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 [08:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:41:19] (03CR) 10Gergő Tisza: [C: 03+1] Add section title and ordinal in the image_suggestion_interaction schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/924879 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [08:58:08] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10Antoine_Quhen) I've a MR with: - It adds linting to the wmf_airflow_common & analytics_test folders. - It proposes some Python code autoformatting... [09:07:05] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:08:02] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) We have merged the first patch to decommission analytics1058. Verified that it has entered decommissioning mode as far as the HDFS namenode is concerned. {F37096550,width=6... [09:15:23] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Geo Analytics Service - https://phabricator.wikimedia.org/T288305 (10SGupta-WMF) a:05BPirkle→03SGupta-WMF [09:20:06] btullis, stevemunene o/ I am going to merge the first change for varnishkafka, to add the catch all unit [09:21:04] it will cause a restart of all the units [09:21:23] but it should be staggered [09:21:32] I'll test on a ulsfo node first [09:25:33] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:26:23] elukey: Great, please keep us in the loop. [09:26:44] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:38] stevemunene: These icinga alerts for nodemanager on analytics1058 are expected. Since we excluded this host from the YARN resourcemanager configuration on the masters, the node no longer connects and then it fires an icinga alert.s. It might keep failing and being noisy, so we can add some downtime for the service. [09:29:17] PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:39] Ack elukey [09:29:51] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:29:59] adding some downtime for the instance btullis [09:30:19] stevemunene: Not the instance, just the affected services. [09:30:56] https://usercontent.irccloud-cdn.com/file/e2QlXE2x/image.png [09:32:21] Sorry, I didn't mean that to sound so contradictory :-) I meant to say that there is a cookbook to add downtime for the whole host, but in this case we just want to silence a couple of services. [09:35:40] You can also create a silence from alertmanager too. I see that you acked it here on alertmanager, but it's short term. https://usercontent.irccloud-cdn.com/file/EYQXEc1A/image.png [09:39:24] Ahaa, thanks Ben [09:52:03] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: Page analytics: per-article should return expected data for monthly granularity - https://phabricator.wikimedia.org/T337564 (10Atieno) picking this up.. [09:55:51] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:58:13] 10Analytics-Radar, 10Patch-For-Review: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 (10TheDJ) [09:58:52] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) We're now watching this `Under replicated blocks ` value decreasing slowly, as the data is copied to other hosts. {F37096584,width=60%} Interestingly, the capacity value has... [10:25:18] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10elukey) An alternative could be to group the nodes in the same rack and decom them in little batches: ` analytics1058.eqiad.wmnet: /eqiad/A/1 -> already started analytics1059.eq... [10:51:35] 10Data-Engineering: BanMeAlreadyLMAO. I'm aka KEMONO_PANTSU. - https://phabricator.wikimedia.org/T338305 (10BanMeAlreadyLMAO) [10:55:43] RECOVERY - Check systemd state on analytics1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:26] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10BTullis) >>! In T317861#8908912, @elukey wrote: > An alternative could be to group the nodes in the same rack and decom them in little batches: > The idea is to avoid to decom nodes... [11:03:51] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10BTullis) Hi @Ladsgroup - I'm planning on running a `maintain-views` on clouddb1021 at some point today as part of T315426. It's going to recreate the views on one table on all databases on all shards. Is this going... [11:05:38] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Ladsgroup) Hi, The optimization of s3 is done so you can go ahead. I'll pick it up again tomorrow. It's much smaller that can wait for a while [11:20:07] ACKNOWLEDGEMENT - SSH on an-worker1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis T338310 Wont power back on. Have raised DC-ops hardware ticket. https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:20:07] ACKNOWLEDGEMENT - Host an-worker1125 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T338310 Wont power back on. Have raised DC-ops hardware ticket. [11:29:40] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) [11:46:50] (03PS8) 10Nick Ifeajika: fix the metric query. Use single write call to destination table and a union all to join by_category and totals metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [11:52:49] !log running `sudo maintain-views --all-databases --table abuse_filter_history --replace-all` on clouddbd1021 for T315426 [11:52:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:00:47] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10BTullis) Thanks, I've done that change now, so you can proceed with the tidy-up whenever you like. [12:11:47] (03CR) 10Gergő Tisza: [C: 03+1] Add section title and ordinal in image suggestions submission events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [12:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:06:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:58] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:43] !log running `maintain-views --all-databases --table abuse_filter_history --replace-all` on A:wikireplicas-analytics [14:04:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:13:05] !log running `sudo cumin A:wikireplicas-web 'maintain-views --all-databases --table abuse_filter_history --replace-all` on A:wikireplicas-web [14:13:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:40:28] (03PS4) 10Sergio Gimeno: Add section title and ordinal in the image_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/924879 (https://phabricator.wikimedia.org/T335716) [14:41:39] (03CR) 10Sergio Gimeno: Add section title and ordinal in the image_suggestion_interaction schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/924879 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [15:06:41] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10elukey) yes yes good point, it should be safe, but I'd be cautious on the batch size just to be sure (HDFS is battle tested but we had some horror stories in the past :D) [15:23:12] !log all varnishkafka instances on caching nodes are getting restarted due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/928087 - T337825 [15:23:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:23:15] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [15:24:46] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Traffic, 10Patch-For-Review: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) The new `varnishkafka-all` unit is being rolled out across all cp nodes. Next steps: * Merge https://gerrit.wikimedia.org/r/924507 (no-op, just... [15:38:38] !log installing presto 0.281 to the test cluster [15:38:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:45:06] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [15:46:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:55] (03Abandoned) 10Snwachukwu: Resolve Guava toImmutableList method error [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/922894 (owner: 10Snwachukwu) [16:09:00] 10Data-Engineering: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10AndyRussG_volunteer) Hi! Thanks so much for the work on this, and apologies for the delay. Here are some notes, in case they're useful: - There are a couple dashboards I created on Superset that might be wor... [16:12:07] 10Data-Engineering: NEW BUG REPORT Need edit rights for Kimberly Sarabia - https://phabricator.wikimedia.org/T332460 (10KSarabia-WMF) [16:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:34:42] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B), 10ci-test-error: Wikimedia-event-utilities jenkins build failure - https://phabricator.wikimedia.org/T338343 (10Snwachukwu) [16:39:02] 10Data-Engineering, 10Data Pipelines, 10Continuous-Integration-Config, 10Event-Platform Value Stream (Sprint 14 B), and 2 others: Wikimedia-event-utilities jenkins build failure - https://phabricator.wikimedia.org/T338343 (10hashar) The java8 imiages are based on Debian Stretch and can no more be rebuild.... [16:49:06] 10Data-Engineering, 10Data Pipelines, 10Continuous-Integration-Config, 10Event-Platform Value Stream (Sprint 14 B), and 2 others: Wikimedia-event-utilities jenkins build failure - https://phabricator.wikimedia.org/T338343 (10hashar) The Jenkins jobs should now point back to the last good image and the buil... [16:52:32] 10Data-Engineering: Upgrade Presto to access UDF library improvements - https://phabricator.wikimedia.org/T295589 (10BTullis) Hi @awight - Apologies for the delay in getting around to this. Just FYI we have upgraded the version of presto on the test cluster to 0.281. If all goes well with our testing, we should... [17:18:07] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Izno) [17:18:19] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Izno) [17:20:19] 10Data-Engineering, 10Data Pipelines, 10Continuous-Integration-Config, 10Event-Platform Value Stream (Sprint 14 B), and 2 others: Wikimedia-event-utilities jenkins build failure - https://phabricator.wikimedia.org/T338343 (10Snwachukwu) Thank you @hashar [17:21:13] 10Data-Engineering, 10Data Pipelines, 10Continuous-Integration-Config, 10Event-Platform Value Stream (Sprint 14 B), and 2 others: Wikimedia-event-utilities jenkins build failure - https://phabricator.wikimedia.org/T338343 (10Snwachukwu) 05Open→03Resolved [17:21:20] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10Snwachukwu) [17:25:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:30:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:00:44] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10dduvall) I'm wondering if the hang is due to some interaction between the `--wait-for-ready` implementation we added to our buildkit fork and the `--output type=lo... [18:22:40] 10Data-Engineering, 10Data Pipelines (Sprint 14): NEW BUG REPORT Need edit rights for Kimberly Sarabia - https://phabricator.wikimedia.org/T332460 (10JArguello-WMF) [18:41:10] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 7 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) [18:45:35] 10Data-Engineering, 10Data Pipelines (Sprint 14): NEW BUG REPORT Need edit rights for Kimberly Sarabia - https://phabricator.wikimedia.org/T332460 (10lbowmaker) 05Open→03Resolved a:03lbowmaker Resolved. Kim's username wasn't showing up in DataHub. @BTullis resolved this using this method: https://phabr... [18:46:49] RECOVERY - Host an-worker1125 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [18:57:56] Starting build #22 for job wikimedia-event-utilities-maven-release-docker [18:58:31] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Media Analytics Service - https://phabricator.wikimedia.org/T288303 (10JArguello-WMF) a:05BPirkle→03codebug [19:00:41] Project wikimedia-event-utilities-maven-release-docker build #22: 04FAILURE in 2 min 45 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/22/ [19:01:44] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Media Analytics Service - https://phabricator.wikimedia.org/T288303 (10JArguello-WMF) a:05codebug→03BPirkle [19:03:02] 10Analytics-Radar, 10Data-Engineering-Icebox, 10ChangeProp, 10Community-Tech, and 3 others: RFC: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10SD0001) Apache ActiveMQ could be a solution here - it allows enqueueing messages that... [19:13:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10dduvall) In the meantime, there are some issues with the current approach that can be ironed out. Who knows, they may have something to do with the hang as well.... [19:18:00] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Media Analytics Service - https://phabricator.wikimedia.org/T288303 (10JArguello-WMF) a:05BPirkle→03codebug [19:30:40] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python: http event process function should report latency. - https://phabricator.wikimedia.org/T338380 (10gmodena) [19:39:14] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Geo Analytics Service - https://phabricator.wikimedia.org/T288305 (10JArguello-WMF) [19:39:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:44:12] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) >>! In T309699#8903606, @Ottomata wrote: > @gmodena what shoul... [19:44:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:51:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:07:53] 10Data-Engineering, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Epic, and 2 others: AQS 2.0: Geo Analytics Service - https://phabricator.wikimedia.org/T288305 (10JArguello-WMF) [20:08:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:11:54] (03CR) 10Gergő Tisza: [C: 03+2] Add section title and ordinal in the image_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/924879 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [21:13:19] (03Merged) 10jenkins-bot: Add section title and ordinal in the image_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/924879 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [23:56:44] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed