[01:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:45] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:47:15] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Wbm1058) OK this thread has bugged and annoyed me to sufficiently motivate me to attempt to use Superset. After looking at the user interface... [04:44:51] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10EpicPupper) > Then provide it with a lot of support, duh. I’m not sure that you fully understand the burden of maintaining a tool. [05:16:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:52] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet has been disabled for 604926 seconds, message: Journal node is about to be decommissioned thus, swap the journal node with another -T338336 - {USER} - stevemunene, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:59] (PuppetDisabled) firing: Puppet disabled on analytics1069:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=analytics&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:22:49] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10elukey) Hi folks! Would it be possible to have either stat1005 or stat1008 (the ones with GPUs) on bookworm? I am asking since we'd have a place wit... [07:39:53] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10dcausse) >>! In T309699#8903561, @Ottomata wrote:... [08:25:08] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10MoritzMuehlenhoff) If git-fat wil be needed on an ongoing basis (also for Bookworm and later, which no longer have any Python 2 at all), it needs to be ported to Python 3, see the older discussio... [09:05:45] !log move varnishkafka instances in ulsfo to PKI [09:05:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:14:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10dcausse) >>! In T309699#8933672, @dcausse wrote:... [09:16:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:26] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [09:18:47] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [09:19:03] 10Data-Engineering-Planning, 10Data-Platform-SRE: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) [09:19:35] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10BTullis) >>! In T329360#8933658, @elukey wrote: > Hi folks! Would it be possible to have either stat1005 or stat1008 (the ones with GPUs) on bookwor... [09:25:42] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10elukey) @BTullis sorry I completely forgot about the hadoop packages, let's not jump to bookworm yet, you folks have enough work on your plate, we'l... [09:36:29] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10IKhitron) > Quarry needs people working on it. For close to two years now that has been mostly me. Lacking greater community support, it is cl... [09:39:19] all vk ulsfo instances running pki! [09:54:56] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Ladsgroup) It's now at 79%, I compress another table and then call it done. [09:58:41] Excellent, thanks elukey [10:26:46] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) I noticed that there were some packages incorrectly added for i386 in hadoop, so I removed them:... [10:28:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) > For badrevids I think that you should... [10:32:30] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) Removed existing bullseye packages with: ` for p in $(sudo -i reprepro -C thirdparty/bigtop15 -A... [10:39:59] (PuppetDisabled) firing: Puppet disabled on analytics1069:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=analytics&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [11:15:08] (03PS2) 10Aqu: [WIP] Load both high and low resolution map at the same time [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 (owner: 10Milimetric) [11:15:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) > I analyzed "badrevids" events from the... [11:25:03] (03PS3) 10Aqu: [WIP] Load both high and low resolution map at the same time [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 (owner: 10Milimetric) [11:39:52] milimetric: for when you're ready: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/431 [12:04:20] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 63 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10TheresNoTime) [12:15:49] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10dcausse) >>! In T309699#8934372, @gmodena wrote:... [12:16:46] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) We now can see some traffic hitting the outlink model server on LiftWing! https://grafana.wikimedia.org/d/zsdYRV7Vk/isti... [12:27:32] FYI, I'm installing container security updates on dse-* (no impact on running pods and already in use by e.g. the ML cluster) [12:34:21] !log Deploy analytics-airlfow to patch mediawiki_history_reduced druid loading [12:34:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:36:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Patch-For-Review: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) [12:41:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) https://www.mediawiki.org/wiki/Manual:M... [12:47:20] !log roll running sre.hadoop.roll-restart-masters to completely remove any reference of analytics1058-1060 for T317861 [12:47:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:47:23] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [12:53:44] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Wbm1058) Am I correct in my assumption that Quarry is 100% "supported" by volunteers? Who wrote it? Ah, looking at [[ https://quarry.wmcloud.... [13:16:02] PROBLEM - HDFS topology check on an-master1001 is CRITICAL: CRITICAL: There is at least one node in the default rack. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [13:26:03] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Wbm1058) >>! In T169452#8933134, @rook wrote: > If superset were to completely fall over, project was removed or some such, https://github.com... [13:40:58] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) Next steps: * Roll out the changes to eqsin, and monitor. * Roll out the changes to codfw, and monitor. * Roll out the changes to eqiad, and monitor. * Roll out the ch... [14:00:40] (03PS1) 10Joal: Add explicit snapshot to HiveToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930636 [14:00:45] mforns: --^ [14:01:12] milimetric: if you have a minute --^ [14:04:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:06:05] (03CR) 10Milimetric: [C: 03+2] "looks good, but how was this working before?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930636 (owner: 10Joal) [14:06:46] milimetric: wanna batcave quickl? [14:06:58] in joal [14:09:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:12:15] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Bawolff) > Where is YuviPanda? Have they abandoned us? Yuvi is a former WMF employee. He has not worked for WMF for quite some time. Regardl... [14:14:49] (03Merged) 10jenkins-bot: Add explicit snapshot to HiveToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930636 (owner: 10Joal) [14:14:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:39:59] (PuppetDisabled) firing: Puppet disabled on analytics1069:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=analytics&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:43:12] milimetric, mforns : https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/432 [14:43:21] lookin [14:44:08] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Wbm1058) >>! In T169452#8935066, @Bawolff wrote: >> Where is YuviPanda? Have they abandoned us? > > Yuvi is a former WMF employee. He has not... [14:49:32] milimetric: shall I start released refinery-source v0.2.17? [14:49:44] I can deploy that [14:50:21] ack [14:52:18] (03PS1) 10Milimetric: Update changelog for v0.2.17 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930646 [14:52:32] (03CR) 10Milimetric: [C: 03+2] Update changelog for v0.2.17 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930646 (owner: 10Milimetric) [14:58:21] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) >>! In T169452#8935227, @Wbm1058 wrote: > So Quarry was written by a Wikimedia Foundation employee as a task that the Foundation assigne... [15:03:31] (03Merged) 10jenkins-bot: Update changelog for v0.2.17 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930646 (owner: 10Milimetric) [15:03:56] Starting build #122 for job analytics-refinery-maven-release-docker [15:17:26] Project analytics-refinery-maven-release-docker build #122: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/122/ [15:23:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:45] (SystemdUnitFailed) firing: (4) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:31:05] Starting build #81 for job analytics-refinery-update-jars-docker [15:31:24] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.17 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/930588 [15:31:33] Project analytics-refinery-update-jars-docker build #81: 09SUCCESS in 27 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/81/ [15:31:45] (SystemdUnitFailed) firing: (5) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:40] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 3 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Jelto) fyi: I started a incident doc at https://wikitech.wikimedia.org/wiki/Incidents/2023-05-28_wikireplicas_lag because it was requested to have this incident in t... [15:35:37] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.17 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/930588 (owner: 10Maven-release-user) [15:36:45] (SystemdUnitFailed) firing: (6) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:04] joal: refinery-source deployed, refinery updated and synced to hdfs [15:53:14] !log refinery-source 0.2.17 deployed, refinery updated and synced to hdfs [15:53:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:53:21] awesome - I think you've merged the airflow code, right? [15:54:42] I auto merged it, lemme check it's done [15:54:49] I t's done [15:54:50] yep, merged [15:54:55] I'm gonna deploy airflow [15:55:34] cool, yeah, restart that job (remember the variables) and let's see [15:55:41] Yup! [15:56:26] !log Deploy airflow to fix druid loading jobs using snapshot [15:56:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:58:55] !log Rerun druid indexation for mediawiki_history_reduced [15:58:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:10:43] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Wbm1058) >>! In T169452#8935287, @rook wrote: >>>! In T169452#8933524, @Wbm1058 wrote: >> ... I drop my SQL >> >> select left(page_links_upda... [16:19:15] (03CR) 10Mforns: [C: 03+2] Add explicit snapshot to HiveToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/930636 (owner: 10Joal) [16:19:27] post merge +2! [16:19:34] 10Data-Engineering, 10DBA, 10Data-Services, 10TaxonBot, and 2 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) You might want to sync up with @KOfori because he's also started one IR. And I have captured a lot more detailed timeline, so maybe we need to merge both. [16:19:35] \o/ [16:19:49] The job failed - looking into it mforns, milimetric [16:23:29] OOM issue [16:23:34] Will bump ressources [16:26:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:14] (03PS1) 10DCausse: mediawiki/revision/create: add mandatory dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/930666 (https://phabricator.wikimedia.org/T267648) [16:31:31] milimetric, mforns: another one sorry: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/433 [16:31:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:21] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) Ah, it was going so well. I tried to downgrade the packages on an-test-worker1001, but there are... [16:51:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:15] ping again milimetric or mforns ? [17:02:28] !log Deploying airflow (again) to fix memory issues [17:02:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:04:42] joal: here! [17:05:10] nevermind mforns - I self-merged and deployed, but a second trial of the job actually succeeded [17:05:20] sorry for the ping [17:05:21] oh, np [17:05:30] great ! [17:20:40] (03CR) 10DCausse: "unsure if this is reasonable to introduce this change for this stream" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/930666 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [17:21:38] ottomata: heya - I've quickly written --^ for Sandra - Would you mind checking with her if it's ok or needs changes? I'll be gone soon [17:22:23] (03PS2) 10DCausse: mediawiki/revision/create: add mandatory dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/930666 (https://phabricator.wikimedia.org/T267648) [17:23:01] (03PS3) 10DCausse: mediawiki/revision/create: add mandatory dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/930666 (https://phabricator.wikimedia.org/T267648) [17:26:19] mforns, milimetric - druid is indexing (see https://yarn.wikimedia.org/proxy/application_1686833367123_0562/) [17:26:35] lookin [17:26:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:20] I'm to leave now, so I let ou monitor that one - You can check when it's done that the datasource is present on the druid-public cluster, with more segments than the previous snapshot :) [17:27:32] Once we have that, it's safe to deplo AQS with the new snapshot [17:31:36] sweet [17:31:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:13] cool joal [17:36:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:04] 10Data-Engineering, 10DBA: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 (10Ladsgroup) 05Open→03Resolved There isn't much we can clean up right now, maybe more aggressive binlog cleaning but at the end, next month or so we will drop the extlinks old columns and that will give us a lot of... [18:26:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:59] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_05 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [19:03:11] milimetric: the druid loading finished, and the overall status is success. However, I saw that there were some failed and killed map as well as reduce tasks... [19:03:58] hm... that can happen with bigger jobs. Looking at segments now [19:06:43] k [19:07:38] not sure if you have a fancier way of looking, but the segments all look fresh and there's enough of them, (4,304 segments) vs. (4,288 segments) for the previous datasource [19:07:59] I'd say we're good to deploy whenever an SRE can get to it [19:09:44] That's 16 more than the previous month [19:10:21] makes sense no? [19:10:39] yes: "num_shards": 16, [19:11:12] +1 [19:12:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_05 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [19:22:01] I am here. What can I do? [19:23:57] milimetric: Is it time for this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/930620 Or is it something else? [19:24:47] btullis: yep, you can do that [19:25:02] I had reverted my revert, https://gerrit.wikimedia.org/r/c/operations/puppet/+/930543 [19:25:07] but same thing [19:25:17] should actually work now :) [19:25:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) > I did a quick analysis a couple years... [19:27:52] !log restarting aqs service on A:aqs in batches of 2, 10 seconds apart [19:27:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:30:42] OK, aqs service all restarted. [19:31:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10gmodena) >>! In T309699#8934671, @Ottomata wrote:... [19:36:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:56] I was looking through wikistats and the new data wasn't showing up, but it's caching. I logged into the AQS machines and verified all the new data looks good. But we might be caching more aggressively than before and I'm not sure why [19:56:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10dcausse) >>! In T309699#8936561, @gmodena wrote:... [19:57:41] (03PS4) 10Milimetric: Increase world map resolution [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 (https://phabricator.wikimedia.org/T338033) [20:01:32] (03CR) 10Milimetric: "ok, @aqu, this is ready for your review. All the other hacky stuff I was doing before was just to try to play with the two different reso" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 (https://phabricator.wikimedia.org/T338033) (owner: 10Milimetric) [20:06:45] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) > I'm not sure it might help unless we... [20:15:21] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Wbm1058) I'm making more sense of this, figuring out things that should have been explained to us from the start of this phab. "Superset" is... [20:17:05] (03CR) 10Aqu: [C: 03+2] "LGTM" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 (https://phabricator.wikimedia.org/T338033) (owner: 10Milimetric) [20:18:24] (03Merged) 10jenkins-bot: Increase world map resolution [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/929816 (https://phabricator.wikimedia.org/T338033) (owner: 10Milimetric) [20:19:11] hey! you go to sleep! [20:19:20] :) [20:21:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:58] (03CR) 10Milimetric: [C: 03+1] "Looks good, just not sure about the monthly filter." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) (owner: 10Aqu) [20:26:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:58] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B): Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10Ottomata) @Snwachukwu @dcausse What's the status?! [20:59:31] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Stuartyeates) I'm aware that our superset install doesn't currently do caching of resultsets, but there is documentation in the superset docs... [21:03:08] 10Data-Engineering-Planning, 10DC-Ops, 10Data-Platform-SRE, 10SRE, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [21:06:45] (SystemdUnitFailed) firing: (7) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:32] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Stuartyeates) >>! In T169452#8936664, @Wbm1058 wrote: > I'm making more sense of this, figuring out things that should have been explained to... [21:11:45] (SystemdUnitFailed) firing: (9) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:45] (SystemdUnitFailed) firing: (9) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:59] (SystemdUnitFailed) firing: (9) refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:45] (SystemdUnitFailed) firing: (7) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:45] (SystemdUnitFailed) firing: (7) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:37] Thanks a lot milimetric mforns and btullis for handling the AQS restart <3 [23:06:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:33] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10nshahquinn-wmf) >>! In T338033#8928153, @Antoine_Quhen wrote: > I'm proposing with those patches: > * ht...