[01:12:09] 10Analytics-Radar, 10Logging, 10SRE, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [02:41:36] 10Analytics, 10Logging: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10lmata) [02:51:22] 10Analytics-Radar, 10Logging, 10SRE, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [08:35:17] 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn This is now resolved. But I'm going to add a note to the docs for switching dumpsdata hosts, so we don't have a repeat. [10:25:27] !log btullis@an-druid1003:~$ sudo puppet agent -tv [10:25:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:26:03] Brining an-druid1003.eqiad.wmnet into service from 'insetup' [10:26:28] ...Bringing. [10:32:12] 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) The ownership of the /var/log/druid directory seems not to have been set correctly during the first puppet run. ` btullis@an-druid1003:/var/log/druid$ ls -ld /var/log/druid/ dr... [10:45:24] 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) It's the same with `/srv/druid` ` btullis@an-druid1003:/var/log/druid$ ls -ld /srv/druid drwxr-xr-x 7 zookeeper 120 4096 Aug 9 10:25 /srv/druid ` The script `/var/lib/dpkg/in... [10:45:54] !log btullis@an-druid1003:/var/log/druid$ sudo chown -R druid:druid /srv/druid /var/log/druid [10:45:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:57:35] a-team gerrit access request: https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/709646 [10:58:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/710414 (https://phabricator.wikimedia.org/T287926) (owner: 10Gergő Tisza) [11:27:45] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) All Icinga checks healthy for an-druid1003. The coordinator is showing a gradually increasing usage, which looks fine. {F34588518} I've created a patch to... [11:30:47] this will allow me to merge my team's new reportupdater aggregation query, for T287578. [11:30:48] T287578: Generate wikitext_2010 edit sessions for normalization in TemplateWizard - https://phabricator.wikimedia.org/T287578 [11:31:13] (03PS2) 10Awight: Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578) [11:38:21] CR+2 access was already given in https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/708633 so this is just a formality [11:38:57] (repo does not yet have a self-merge job triggered by CR) [12:25:15] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) The patch to create the user seems to have worked at first pass. ` Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/Group[druid]/ensure: created Notice: /S... [12:28:38] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Ah, no. It didn't work. They're owned by root, so the postinst script didn't set the ownership. ` btullis@an-druid1004:~$ ls -ld /var/log/druid drwxr-xr-... [12:31:31] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Changed the ownership and restarted the services manually. ` btullis@an-druid1004:~$ sudo chown -R druid:druid /var/log/druid /srv/druid btullis@an-druid10... [17:13:34] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Bringing an-druid1005 into service now, with the latest change to the installation of druid. ` Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/Group[drui... [17:43:10] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) All looks to be OK with the segment rebalance. {F34588731} I will leave this overnight, then continue with the removal of the droid100[1-3] nodes tomorrow... [18:08:14] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:22:04] ^ I'm looking into this Icinga failure. I don't *think* that it's related to the druid refresh, but can't rule it out. The error from journalctl is this: `Aug 09 18:01:49 an-launcher1002 eventlogging_to_druid_prefupdate_hourly[9293]: 21/08/09 18:01:49 ERROR DataFrameToDruid: Druid ingestion task index_hadoop_event_prefupdate_ohkbglbb_2021-08-09T18:00:39.359Z for event_prefupdate failed.` [18:34:07] https://usercontent.irccloud-cdn.com/file/BNiDN469/image.png [18:36:25] I don't yet know whether I need to re-run this or whether the next hourly job (in 24 minutes) should deal with the missing data appropriately. [19:02:34] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:09:02] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:03:22] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:05:50] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:00:14] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:04:02] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers