[00:33:45] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:40:51] (03PS25) 10AGueyte: WIP: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) [06:35:39] 10Analytics: Upgrade dbstore100* hosts to Bullseye - https://phabricator.wikimedia.org/T299481 (10Marostegui) [08:21:19] very interesting https://hop.apache.org/manual/latest/getting-started/hop-what-is-hop.html [08:21:23] (new top level project( [08:23:22] elukey: thanks for posting :) [11:12:33] Hi, I'm trying to figure out an issue with the access statistics of dumps.wikimedia.org, see https://phabricator.wikimedia.org/T292621 and https://phabricator.wikimedia.org/T299358 -- somehow the access logs are incomplete [11:13:00] (not sure if that is fundamentally an a-team or an ops team issue) [11:14:43] We so far concluded that it is probably not an issue of a race condition between two servers syncing their logs to stat1007 [11:15:27] MichaelG_WMDE: Hmm. I see. My first thought is that it could be something to do with this ticket: https://phabricator.wikimedia.org/T285355 [11:15:27] I /think/ that dumps used to be hosted on thorium and were moved to an-web1001 - so perhaps the log processing didn't get updated to match somehow. [11:16:50] btullis: that looks promising! I'll have a look, thanks :) [11:20:05] MichaelG_WMDE: I'm not so sure, now that I look at it more closely. It seems that dumps is definitely a labstore host, not thorium/an-web like I was thinking. [11:20:10] https://www.irccloud.com/pastebin/o4Xnsi7h/ [11:20:45] 🤔 [11:22:19] yeah, that matches the output of `host dumps.wikimedia.org` [11:26:00] https://usercontent.irccloud-cdn.com/file/QLaAApVn/image.png [11:26:35] I wonder for how long it has been like this? [11:28:17] 😬 [11:28:40] There is no icinga check for either systemd as a whole, nor for this particular service. [11:28:43] https://usercontent.irccloud-cdn.com/file/TaGWuEi3/image.png [11:28:43] so it writes _some_ lines of the logs and then fails for some reason? [11:29:08] that would explain some of the confusing data that we're seeing [11:30:33] There are 30 days' worth of logs on disk on labstore1006. When did the incident start, as far as you are aware? [11:30:38] https://www.irccloud.com/pastebin/k15TF1vK/ [11:31:03] May 2021 or earlier [11:32:13] but wait: if it fails with "No such files or directory" wouldn't we expect to see _no_ logs being synced instead of truncated ones? [11:32:43] Yes. [11:33:03] Time to look at the secondary server, I think. See what that's doing... [11:35:17] It's a good start. There is no `rsync_nginxlogs.service` on labstore1007. [11:41:04] Yet we *do* have some kind of rsync connection coming in from labstore1007. [11:41:07] https://www.irccloud.com/pastebin/imgyiZpO/ [11:42:17] Confirmed in `/var/log/syslog.1` on labstore1007 [11:42:22] `Jan 19 04:55:01 labstore1007 CRON[11754]: (root) CMD (/usr/bin/rsync -rt --perms --chmod=go+r --bwlimit=50000 /var/log/nginx/*.gz stat1007.eqiad.wmnet::dumps-webrequest/)` [11:42:37] mh, though 32 bytes is not much [11:43:58] oh wait, those seem to be from some previous connection [11:44:06] No, I agree, but the total size is much bigger. [11:44:22] Here's the crontab entry. [11:44:28] https://www.irccloud.com/pastebin/i0fQFUEZ/ [11:45:12] Oh yeah, you're right. I missed that the B2 in my grep picked up some other rsync from 04:31 [11:46:06] https://www.irccloud.com/pastebin/8Q3eGMFi/ [11:47:27] could you also paste the lines from the entry right afterwards? That seems to be the corresponding rsync-attenpt from labstore1006 [11:48:11] I can remove the entry from root's crontab on labstore1007 and then run puppet, to make sure that it doesn't come bac, [11:48:16] https://www.irccloud.com/pastebin/eJdySg5v/ [11:48:52] gotcha, those entries make sense together with the failed systemd job [11:50:51] yeah, that might make sense. Though I wonder if there would be some value in having the logs of both servers? [11:50:52] It doesn't explain the "No such files or directory" though for the systemd job. [11:51:33] yeah. What *is* in `/var/log/nginx/` on labstore1006? [11:51:55] (I don't have sufficient credentials to connect there, I think) [11:52:28] 30 days each of access and error logs, gzipped. I will try to re-run the service now in the foreground, to see if I can see what the error ir. [11:52:31] is [11:58:58] I just ran the command manually and it worked. [11:59:13] https://www.irccloud.com/pastebin/tBuGRac0/ [11:59:26] https://www.irccloud.com/pastebin/y9GHPLpu/ [12:00:22] But when it's run as a systemd service it still says that it can't find any files named: `/var/log/nginx/*.gz` [12:00:33] Very strange. [12:01:15] 🤨 [12:02:18] it sounds like /var/log/nginx/*.gz might be quoted, but in the operations/puppet service definition it doesn’t look like it is [12:04:29] Yeah, I wonder whether it's something to do with that /usr/local/bin/systemd-timer-mail-wrapper fiddling with the arguments. [12:04:58] oh wait, does rsync even expand globs? [12:05:07] or does that need an `sh -c` wrapper to do it in the shell? [12:05:17] because systemd certainly won’t do it [12:06:02] 10Analytics-Radar, 10Revision-Slider, 10Two-Column-Edit-Conflict-Merge, 10WMDE-TechWish, and 2 others: Where should keys used for stats in an extension be documented? - https://phabricator.wikimedia.org/T185111 (10thiemowmde) [12:07:25] back when this sync was a cronjob it probably included a shell level that expanded the glob [12:07:42] and then it got turned into a timer/service, where the command has more limited syntax [12:07:59] IUIU, that sync runs correctly, though unintentionally, on labstore1007 [12:09:04] I think it might need single quotes around '*.gz' in the command specification. [12:09:48] I found one other example of a glob in a systemd::timer::job type here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/manifests/instance.pp#302 [12:09:49] There are probably others, but this is the first that I found. [12:09:49] ah, on labstore1007 it is still run by cron, so that is probably why it works there [12:10:33] btullis: yeah, but find can expand globs by itself [12:10:37] I’m not convinced rsync does that [12:10:49] Yeah, rsync definitely handles globs OK, I think that it's systemd or the defined type that is stripping it out. Hang on, I'll create a CR now to check if it fixes it. [12:11:00] “Note that the expansion of wildcards on the command-line (*.c) into a list of files is handled by the shell before it runs rsync and not by rsync itself (exactly the same as all other Posix-style programs).” (rsync(1)) [12:11:42] Thanks Lucas_WMDE - makes sense. [12:11:45] if I run `rsync '*' glob/` locally I get “no such file or directory” [12:19:19] Here is the patch. [12:19:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/755347 [12:26:28] won’t this still do the wrong thing? [12:26:30] I’ll let you try it out… [12:32:41] Well you're right. It's not working. Even after doing a `systemctl daemon-reload`the newly added quotes don't actually get added to the unit. They are present in the on-disk file, but not in the unit once it is loaded. [12:32:45] https://www.irccloud.com/pastebin/WJvP75uM/ [12:33:28] I will revert this change. [12:33:40] I think you need something like `sh -c 'rsync … /var/log/nginx/*.gz stat1007'` as the command [12:33:43] or a find -exec [12:35:12] nah, nevermind, find -exec isn’t a good idea if there needs to be another argument after the {} [12:36:30] mh, I found two more systemd rsync commands that also use a `*`. I wonder if they are also broken? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/openstack/base/keystone/fernet_keys.pp#39 and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/microsites/os_reports.pp#19 [12:36:31] Yeah, I'd prefer not to launch 60 rsync processes as well if possible, for the sake of the logs. [12:38:00] yaeh, I was hoping to use -exec + instead of \; (to only run one process) but that won’t work due to the dest argument :( [12:38:22] MichaelG_WMDE: 🤔 they might be broken indeed [12:41:11] I'll see if I can ping people in some other channel about this. maybe in the -serviceops channel 🤔 [12:41:55] Cool. I'm breaking for lunch now, but I can carry on with this later. Thanks for bringing it up. We also need Icinga alerts on these failed timers. [12:43:51] uploaded a change that might help https://gerrit.wikimedia.org/r/c/operations/puppet/+/755352/ [12:44:03] I’ll also go for lunch soon, see you later then :) [13:08:02] (03CR) 10Tchanders: WIP: Basic ipinfo instrument setup (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [13:48:31] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Ottomata) > this would be a stopgap until data eng is upgrading to a puppetized spark3? No, I think we would also use the anaconda-wmf spark3 installation. And use anaconda-wmf as the default python whenever we need... [14:00:44] !log installing anaconda-wmf_2020.02~wmf6_amd64.deb on stat1004 - T292699 [14:00:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:00:48] T292699: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 [14:16:14] milimetric: yt? i just installed a new anconda-wmf on stat1004 with th CPPFLAGS change [14:16:19] can you test your stuff there? [14:57:54] ah sorry, got logged out, trying now ottomata [14:59:23] k ty [15:22:22] works great ottomata, thank you! [15:23:48] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10Milimetric) @nshahquinn-wmf Andrew made the changes on stat1004 and I tested and everything seemed... [15:24:20] ok great, i'll proceed with installing everywhere else [15:44:52] !log installing anaconda-wmf_2020.02~wmf6_amd64.deb on all analytics cluster nodes. - T292699 [15:44:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:44:55] T292699: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 [15:46:54] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10Ottomata) Installing on all analytics cluster nodes: ` sudo cumin -b 10 -m async 'C:profile::analyt... [16:11:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10BTullis) I've made several more steps forward on this, with sincere thanks to @elukey. Unfortunately we have now hit a serious blocker in terms of the Hive integration with At... [16:13:15] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10Ottomata) Or...can we upgrade Hive as recommended? [16:14:36] ottomata: o/ I think that hive 3 needs hadoop 3 :( [16:14:54] so we should upgrade to bigtop 3.x [16:16:28] yeah lets do it all [16:16:30] everything 3 [16:16:31] spark 3 [16:16:40] hive 3 [16:16:41] hadoop 3 [16:16:43] elukey 3 [16:16:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10BTullis) OK, agreed that is another way forward. I will look into it. I had assumed that it would be a lot more work than we had bargained for, but maybe not. [16:16:58] <3 [16:17:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10Ottomata) It might indeed be more than we bargained for. :/ [16:25:45] 10Data-Engineering, 10Project-Admins: Make EChetty Editor of Data-Catalog workboard - https://phabricator.wikimedia.org/T299541 (10odimitrijevic) [16:26:55] 10Data-Engineering, 10Project-Admins: Create a workboard for Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10odimitrijevic) Thanks so much! Can you please also add @Echetty, our product manager as a trusted contributor? [16:33:07] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10BTullis) I believe that this has now been fixed by a third-party. Attempting to confirm now with a new build. [16:44:41] 10Analytics, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10kzimmerman) 05Open→03Declined Declining this as we don't have plans to continue this work in the foreseeable future. [16:45:10] 10Analytics-Radar, 10Product-Analytics: Content for analytics.wikimedia.org - https://phabricator.wikimedia.org/T267254 (10kzimmerman) 05Open→03Declined Declining this as we don't have plans to work on it in the foreseeable future. [16:45:13] 10Analytics, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10kzimmerman) [16:46:49] 10Data-Engineering, 10Project-Admins: Make EChetty Editor of Data-Catalog workboard - https://phabricator.wikimedia.org/T299541 (10odimitrijevic) 05Open→03Invalid Closing as duplicate of T299357 [16:49:07] 10Analytics-Radar, 10Research-Backlog, 10Research-consulting: Report on Wikimedia's industry ranking - https://phabricator.wikimedia.org/T141117 (10kzimmerman) [16:52:22] 10Data-Engineering, 10Project-Admins: Make EChetty Editor of Data-Catalog workboard - https://phabricator.wikimedia.org/T299541 (10Aklapper) (Feel free to {nav icon=anchor,name=Edit Related Tasks... > Close As Duplicate} in the upper right corner. Thanks!) [16:52:36] 10Data-Engineering, 10Project-Admins: Create a workboard for Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10Aklapper) [16:52:40] 10Data-Engineering, 10Project-Admins: Make EChetty Editor of Data-Catalog workboard - https://phabricator.wikimedia.org/T299541 (10Aklapper) [16:53:06] 10Data-Engineering, 10Project-Admins: Create a workboard for Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10Aklapper) >>! In T299357#7633085, @odimitrijevic wrote: > Can you please also add @Echetty, our product manager as a trusted contributor? {{Done}} [16:54:58] 10Data-Engineering, 10Project-Admins: Allow folks to create/edit workboard for #Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10Aklapper) [16:55:06] 10Data-Engineering, 10Project-Admins: Allow folks to create/edit workboard for #Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10Aklapper) 05Open→03Resolved [17:01:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10BTullis) Confirmed with a fresh build and package. [17:03:18] ping a-team - standup :) [17:06:05] OOPS [17:13:25] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) By way of an update: Heap utilization since the 4th has remained stable. Our canary (1014-b... [17:17:58] 10Data-Engineering, 10MediaWiki-General: Update pingback "PHP Version" dashboards - https://phabricator.wikimedia.org/T298922 (10nshahquinn-wmf) Apparently, #analytics is going to be archived; all the tasks that the team plans to work are now in #data-engineering. [17:20:57] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10RobH) [17:23:03] 10Data-Engineering, 10Product-Analytics: 22 small wikis missing from the mediawiki_history dataset - https://phabricator.wikimedia.org/T299548 (10nshahquinn-wmf) [18:06:38] 10Data-Engineering, 10SRE, 10Traffic, 10Patch-For-Review: VarnishKafka to propagate user agent client hints headers to webrequest - https://phabricator.wikimedia.org/T299401 (10phuedx) @JAllemandou: @elukey highlighted that we (Data Engineering and other stakeholders) should agree on the names for these he... [18:11:02] 10Data-Engineering, 10Project-Admins: Allow folks to create/edit workboard for #Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10odimitrijevic) Wonderful! Thank you so much! [18:18:41] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Product-Analytics: Wikistats reports no mobile unique devices for Wikidata and MediaWiki.org - https://phabricator.wikimedia.org/T299559 (10Aklapper) [18:20:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Connect Atlas to a Data Source - https://phabricator.wikimedia.org/T298710 (10BTullis) As discussed in T296670#7633000 it looks like we have a serious blocker in connecting our test instance of Atlas to an existing Hive metastore. We would either... [18:22:53] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10BTullis) In case it helps, the UI for Atlas on the test cluster can be accessed by using an SSH tunnel like so: `ssh -NL 21000:an-test-coord1001.eqiad.wmnet:21000 an-test-coor... [18:42:20] Heya razzi - I have investigated with Irene her dashboard-filtering problem, and it happens that there was no admin need [18:42:57] razzi: There is a hidden UI in edit-mode allowing to specify which filters apply to which charts [18:43:20] razzi: I think this UI is very new, and the documentation is not up-to-date [18:43:30] razzi: I thought I'd let you know :) [18:52:43] PROBLEM - Check unit status of eventlogging_to_druid_network_internal_flows_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_internal_flows_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:54:55] joal: ^ [18:55:15] mwarf :( [18:58:33] that's weitrd - I get nothing in logs :( [18:58:41] ok good to know joal re: dashboard filtering [18:59:01] thanks for looking in to that; I can remove the admin role from her user then? [18:59:09] razzi: done already :) [18:59:17] 👍 [19:13:52] 10Data-Engineering, 10Privacy Engineering, 10Research: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) 05Open→03Resolved a:03Isaac Long over-due resolving of task. Thanks again to all who were involved / supported! *... [19:58:17] joal: will merge your job change and fix shortly... [20:12:50] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10Ottomata) Done. [20:18:27] RECOVERY - Check unit status of eventlogging_to_druid_network_internal_flows_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_internal_flows_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:57:51] joal: merged [20:58:01] and removed old jobs [23:25:42] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Okaayyyyyy! I've got a WIP branch going that does some nice stuff with Skein and Spark: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/7...