[01:12:09] <wikibugs>	 10Analytics-Radar, 10Logging, 10SRE, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata)
[02:41:36] <wikibugs>	 10Analytics, 10Logging: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10lmata)
[02:51:22] <wikibugs>	 10Analytics-Radar, 10Logging, 10SRE, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata)
[08:35:17] <wikibugs>	 10Analytics, 10Dumps-Generation: xmldatadumps dumpstatus.json files only readable by root - https://phabricator.wikimedia.org/T287989 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn This is now resolved. But I'm going to add a note to the docs for switching dumpsdata hosts, so we don't have a repeat.
[10:25:27] <btullis_>	 !log btullis@an-druid1003:~$ sudo puppet agent -tv
[10:25:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:26:03] <btullis_>	 Brining an-druid1003.eqiad.wmnet into service from 'insetup'
[10:26:28] <btullis_>	 ...Bringing.
[10:32:12] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) The ownership of the /var/log/druid directory seems not to have been set correctly during the first puppet run.  ` btullis@an-druid1003:/var/log/druid$ ls -ld /var/log/druid/ dr...
[10:45:24] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) It's the same with `/srv/druid`  ` btullis@an-druid1003:/var/log/druid$ ls -ld /srv/druid drwxr-xr-x 7 zookeeper 120 4096 Aug  9 10:25 /srv/druid `  The script `/var/lib/dpkg/in...
[10:45:54] <btullis_>	 !log btullis@an-druid1003:/var/log/druid$ sudo chown -R druid:druid /srv/druid /var/log/druid
[10:45:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:57:35] <awight>	 a-team gerrit access request: https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/709646
[10:58:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/710414 (https://phabricator.wikimedia.org/T287926) (owner: 10Gergő Tisza)
[11:27:45] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) All Icinga checks healthy for an-druid1003. The coordinator is showing a gradually increasing usage, which looks fine. {F34588518}  I've created a patch to...
[11:30:47] <awight>	 this will allow me to merge my team's new reportupdater aggregation query, for T287578.
[11:30:48] <stashbot>	 T287578: Generate wikitext_2010 edit sessions for normalization in TemplateWizard - https://phabricator.wikimedia.org/T287578
[11:31:13] <wikibugs>	 (03PS2) 10Awight: Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709646 (https://phabricator.wikimedia.org/T287578)
[11:38:21] <awight>	 CR+2 access was already given in https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/708633 so this is just a formality
[11:38:57] <awight>	 (repo does not yet have a self-merge job triggered by CR)
[12:25:15] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) The patch to create the user seems to have worked at first pass. ` Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/Group[druid]/ensure: created Notice: /S...
[12:28:38] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Ah, no. It didn't work. They're owned by root, so the postinst script didn't set the ownership.   ` btullis@an-druid1004:~$ ls -ld /var/log/druid drwxr-xr-...
[12:31:31] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Changed the ownership and restarted the services manually. ` btullis@an-druid1004:~$ sudo chown -R druid:druid /var/log/druid /srv/druid btullis@an-druid10...
[17:13:34] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) Bringing an-druid1005 into service now, with the latest change to the installation of druid.  ` Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/Group[drui...
[17:43:10] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) All looks to be OK with the segment rebalance. {F34588731} I will leave this overnight, then continue with the removal of the droid100[1-3] nodes tomorrow...
[18:08:14] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:22:04] <btullis_>	 ^ I'm looking into this Icinga failure. I don't *think* that it's related to the druid refresh, but can't rule it out. The error from journalctl is this: `Aug 09 18:01:49 an-launcher1002 eventlogging_to_druid_prefupdate_hourly[9293]: 21/08/09 18:01:49 ERROR DataFrameToDruid: Druid ingestion task index_hadoop_event_prefupdate_ohkbglbb_2021-08-09T18:00:39.359Z for event_prefupdate failed.`
[18:34:07] <btullis_>	 https://usercontent.irccloud-cdn.com/file/BNiDN469/image.png
[18:36:25] <btullis_>	 I don't yet know whether I need to re-run this or whether the next hourly job (in 24 minutes) should deal with the missing data appropriately.
[19:02:34] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:09:02] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:03:22] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:05:50] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:00:14] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:04:02] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers