[00:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:21:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:31:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:58:45] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [02:00:52] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [04:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:31:44] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:34] !log kill remaining processes for `andyrussg` on stat100x nodes to unblock puppet [06:10:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:42:58] !log stop hadoop-hdfs-journalnode on analytics1069 in order to swap the journal node with an-worker1142 T338336 [06:43:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:43:01] T338336: Swap an existing Journal node analytics1069 with an-worker1142 - https://phabricator.wikimedia.org/T338336 [06:46:27] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-journalnode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:29] PROBLEM - Hadoop JournalNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [06:46:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:40] Hi elukey btullis How long does a typical decommissioning of a namenode take? noticed that analytics1058 hasn't moved much from yesterday https://usercontent.irccloud-cdn.com/file/hHL2IQND/image.png [07:06:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:43] stevemunene: o/ it takes a bit of time since the blocks needs to be migrated to other nodes first, there is a graph in grafana for that [07:11:47] unreplicated blocks [07:12:30] underreplicated sorry [07:12:30] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41 [07:12:41] once --^ is done the decom should be completed IIRC [07:13:17] I'd say in an hour or two it should be completed [07:16:25] Great, thanks elukey [07:21:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:03] 10Data-Engineering, 10Data Pipelines, 10Continuous-Integration-Config, 10Event-Platform Value Stream (Sprint 14 B), and 2 others: Wikimedia-event-utilities jenkins build failure - https://phabricator.wikimedia.org/T338343 (10hashar) The root cause is those java8 CI images are still based on Debian Stretch... [08:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:29:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) [08:54:46] Hi stevemunene - elukey is right, it's not the //capacity// value that we're watching, but the underreplicated blocks. I made a note about this yesterday here: https://phabricator.wikimedia.org/T317861#8908799 [08:56:06] I'd initially expected the data to physically move from the worker node as it is being decommissioned, but it turns out that it is just //copied// else where and left in place. [09:09:49] btw the node should be done, no more under replicated blocks [09:28:13] stevemunene: Do you want to pair on finishing up the journal move? [09:31:23] btullis: currently in a session with gehel maybe in the next 30? [09:31:50] btullis: if you want to join us, we're having a look at how to deal with strings of patches [09:31:55] https://meet.google.com/czw-cvbw-gqh [09:32:32] gehel: Thanks. I'll just grab a coffee and then join in. [09:48:59] (03CR) 10Kosta Harlan: Add section title and ordinal in image suggestions submission events (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [10:43:31] (03PS3) 10Sergio Gimeno: Add section title and ordinal in image suggestions submission events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) [10:44:08] (03CR) 10CI reject: [V: 04-1] Add section title and ordinal in image suggestions submission events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [11:21:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:27] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Anti-Harassment, 10CheckUser, and 7 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Tchanders) [11:28:27] (03PS4) 10Sergio Gimeno: Add section title and ordinal in image suggestions submission events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) [11:28:47] (03CR) 10Sergio Gimeno: Add section title and ordinal in image suggestions submission events (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [12:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:37:33] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-enginee... [12:41:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/... [12:48:47] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python: http event process function should report latency. - https://phabricator.wikimedia.org/T338380 (10Ottomata) This might be tricky because the user's enrich function does not have access to the Flink context, which is used to emit custom m... [13:11:20] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10JArguello-WMF) [13:17:41] mforns: Good afternoon - Would you have some time for me to talk about netflow druid data? [14:25:55] Starting build #23 for job wikimedia-event-utilities-maven-release-docker [14:29:26] Yippee, build fixed! [14:29:26] Project wikimedia-event-utilities-maven-release-docker build #23: 09FIXED in 3 min 32 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/23/ [14:34:56] 10Quarry, 10cloud-services-team (Kanban): quarry-nfs-1 went down; quarry is offline - https://phabricator.wikimedia.org/T302154 (10Andrew) >>! In T302154#8884891, @Framawiki wrote: >>>! In T302154#7724373, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-cloud), href=https://sal.toolforge.... [14:35:11] elukey ooops thanks much! [14:35:38] (03CR) 10Snwachukwu: [C: 03+1] "recheck" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/923646 (https://phabricator.wikimedia.org/T337421) (owner: 10DCausse) [14:37:27] (03PS3) 10DCausse: Use eventutilites shaded jar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/923646 (https://phabricator.wikimedia.org/T337421) [14:40:45] (03CR) 10jenkins-bot: Use eventutilites shaded jar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/923646 (https://phabricator.wikimedia.org/T337421) (owner: 10DCausse) [14:49:41] 10Analytics, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Documentation, and 4 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10Atieno) [14:55:25] (03CR) 10Snwachukwu: [C: 03+2] Use eventutilites shaded jar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/923646 (https://phabricator.wikimedia.org/T337421) (owner: 10DCausse) [15:03:14] (03Merged) 10jenkins-bot: Use eventutilites shaded jar [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/923646 (https://phabricator.wikimedia.org/T337421) (owner: 10DCausse) [15:10:41] 10Data-Engineering, 10Data-Engineering-Wikistats: Wikistats returns a blank page when switching language to Icelandic - https://phabricator.wikimedia.org/T338466 (10jhsoby) [15:21:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:44] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/... [15:31:27] (03Abandoned) 10Nmaphophe: GDI Equity Landscape Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/924131 (owner: 10Nmaphophe) [15:47:47] 10Data-Engineering-Planning, 10Data-Platform-SRE: Rebuild hive-hcatalog package for bullseye to address missing symlinks - https://phabricator.wikimedia.org/T337465 (10BTullis) OK, this is looking better now. I've added a patch to the hive component that allows us to use archiva's mirror for the missing jar fi... [15:58:31] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14): Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10Antoine_Quhen) a:03Antoine_Quhen [16:01:45] Hi mforns - I pinged you earlier on - will you have some time after standup? [16:02:09] oh, joal, didn't see it... sorry. Yes, ofc. let's meet [16:02:37] thank you :) [16:17:48] stevemunene: o/ re https://gerrit.wikimedia.org/r/c/operations/puppet/+/928349 - let's not leave the cluster into maintenance mode for too long, we have 5 journal nodes so not a big risk but we should try to not leave in-progress migrations if possible (say that somebody has to check later on because there is a problem etc..) [16:18:53] This should be ready to merge elukey added a comment on the ticket [16:19:29] (03PS13) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:20:08] stevemunene: nice! +1ed the change [16:20:13] (03CR) 10CI reject: [V: 04-1] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:20:33] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JArguello-WMF) [16:25:39] (03PS14) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:26:07] (03CR) 10CI reject: [V: 04-1] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:31:33] (03PS15) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:34:14] (03CR) 10Ottomata: "I brought back entity/page_link after all." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:35:49] (03CR) 10Ottomata: Encode redirect targets in page change events. (032 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:37:24] (03CR) 10Peter Fischer: Encode redirect targets in page change events. (033 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:39:11] stevemunene: Do you want to pair on the journalnode now, or is it a bit late where you are? I agree with elukey that it would be good to get it back up to 5 journalnodes sooner rather than later, for resilience. [16:42:26] (03CR) 10Peter Fischer: "LGTM! I like how clean it has become." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:43:06] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14): Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10Milimetric) This [[ https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/geoeditors/geo... [16:43:48] 10Data-Engineering-Planning, 10Data-Platform-SRE: Rebuild hive-hcatalog package for bullseye to address missing symlinks - https://phabricator.wikimedia.org/T337465 (10BTullis) Confirmed, I built the package again with the patch to change the suffix in the name of the package and the symlinks are indeed missin... [16:48:45] (03CR) 10Ottomata: Encode redirect targets in page change events. (034 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:49:45] (03PS16) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:50:12] (03CR) 10CI reject: [V: 04-1] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:51:06] btullis: I am available to pair [16:52:46] (03CR) 10Peter Fischer: Encode redirect targets in page change events. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [16:54:03] 10Analytics, 10AQS2.0, 10API Platform (AQS 2.0 Roadmap), 10Documentation, and 4 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10JArguello-WMF) [16:54:07] 10Quarry, 10cloud-services-team (Kanban): quarry-nfs-1 went down; quarry is offline - https://phabricator.wikimedia.org/T302154 (10Andrew) Ah, sorry, I should've read back further in the task! Yes, that host can+should be deleted. [16:55:40] stevemunene: Let's go for it! [16:56:21] I';m in our daily sync room. [17:01:43] !log running puppet on an-worker1142 to start the new journalnode [17:01:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:12:44] !log running the sre.hadoop.roll-restart-masters cookbook for the analytics cluster, to pick up the new journalnode for T338336 [17:12:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:12:46] T338336: Swap an existing Journal node analytics1069 with an-worker1142 - https://phabricator.wikimedia.org/T338336 [17:20:42] This is the current dfshealth on an-master1002, you can see that the IP address for analytics1069 is still present. [17:20:44] https://usercontent.irccloud-cdn.com/file/83gzYXm2/image.png [17:22:07] The cookbook is currently waiting 10 minutes after the startup of the namenode on an-master1001, before it asks to fail back. We should check that the IP address for an-worker1142 is shown instead. [17:30:24] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) >> precise names of the fields in the data (we can look for this in realtime in the data when it starts flowing) > Sure,... [17:32:10] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14): Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10mpopov) > However, we should also productionize this to be in the refinery repo and maintained centrally so we don't rely on H... [17:55:48] (03PS17) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [17:56:16] (03CR) 10Ottomata: Encode redirect targets in page change events. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [17:56:18] (03CR) 10CI reject: [V: 04-1] Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [17:57:29] (03PS18) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [18:01:43] (03PS19) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [18:03:48] (03CR) 10Peter Fischer: "I think we have the right amount of information in the right places now." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [18:04:20] (03PS20) 10Ottomata: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [18:09:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [18:14:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [18:34:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [18:39:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [19:21:44] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:02:32] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/... [20:04:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:19:44] (SystemdUnitCrashLoop) firing: crashloop on an-test-client1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:22:36] (03CR) 10Gergő Tisza: [C: 03+2] Add section title and ordinal in image suggestions submission events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [20:23:13] (03Merged) 10jenkins-bot: Add section title and ordinal in image suggestions submission events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/926468 (https://phabricator.wikimedia.org/T335716) (owner: 10Sergio Gimeno) [20:23:24] PROBLEM - aqs endpoints health on aqs2001 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [20:26:03] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 5 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [20:36:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:01:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:51] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Patch-For-Review: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) Alright, in the latest patch for including redirect page link... [21:31:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:44] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:58] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed