[03:44:47] elukey Thanks! [06:03:32] :) [10:48:43] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:10:35] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:14:05] hi A team event logging folks :P Anyone any idea about the last 2 comments on https://phabricator.wikimedia.org/T286655 ? [12:15:03] 10Analytics, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Automate regular WDQS query parsing and data-extraction - https://phabricator.wikimedia.org/T273854 (10Gehel) 05Open→03Resolved [12:18:56] 10Analytics, 10Discovery-Search (Current work): Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10Gehel) 05Open→03Resolved [12:55:34] addshore: since they haven't responded within a week, I'd go wtih no, it does not block [12:55:59] lemme take areal quick look and give a +1 [12:56:01] or comment [12:58:15] cool! ty! [12:58:16] (03CR) 10Ottomata: [C: 03+1] "Haven't done a thorough review but looks ok to me! If you don't hear back from PDI folks I think you can merge." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [13:21:20] Morning. I have a pcc compilation failure that I'm unable to explain. `./utils/pcc 706661 an-test-presto1001.eqiad.wmnet` or https://puppet-compiler.wmflabs.org/compiler1002/30339/an-test-presto1001.eqiad.wmnet/index.html - Any ideas anyone? [13:22:01] It's mentioning a strange keytabs file, but I'm not sure why. [13:22:03] yes i think so [13:22:03] https://puppet-compiler.wmflabs.org/compiler1002/30339/an-test-presto1001.eqiad.wmnet/change.an-test-presto1001.eqiad.wmnet.err [13:22:04] so [13:22:21] there is a repo called 'labs/private' or something (will find link one sec) [13:22:24] that isn't really private [13:22:29] but is used for dummy secrets [13:22:32] PCC runs in labs [13:22:34] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/ [13:22:41] ty [13:23:00] so, i betcah if you look in there in i htink [13:23:10] secrets/secret/kerberos/keytabs [13:23:12] or something like that [13:23:16] you'll find other example of dummy keytabs [13:23:26] so youu can just add the file it is looking for [13:23:26] kerberos/keytabs/an-test-presto1001.eqiad.wmnet/alluxio/alluxio.keytab [13:23:32] with dummy content like you see in the other files [13:24:44] OK, thanks. Super-responsive, thanks :-) [13:27:05] yep my fault sorry :) [13:28:41] OK, so I can see that the alluxio keytab is mentioned in the coordinator.yaml `profile::kerberos::keytabs::keytabs_metadata` hash. [13:29:49] But this node is still running green in production on https://puppetboard.wikimedia.org/node/an-test-presto1001.eqiad.wmnet [13:29:49] Does this mean that it would fail the next time someone does a puppet-merge on a puppetmaster? [13:30:45] hm, no, it just means PCC will fail [13:30:53] PCC runs outsidie of production [13:30:56] btullis: nono the real keytab is stored in the puppet private repo [13:31:00] so doesn't have access to the secrets [13:31:09] the labs_private basically mimics it for pcc [13:31:35] elukey (hello!) moritzm btw, looking for review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/707564, would like to work on that today if possible! :) [13:31:44] Oh I *see*. So if we add a real secret, we also have to add it to the fake secrets repo, otherwise pcc can't find it. [13:32:19] ottomata: hello! Yes I have it in my todo list, I already give it a quick pass and afaics seems good, the only question mark is about the home dirs [13:32:29] I'll re-check it in max 30 mins [13:32:36] btullis: exactly yes [13:33:01] otherwise when puppet is evaluated in the context of pcc it will not find secrets and hiera private [13:33:07] and it will fail like you showed earlier on [13:33:19] elukey: <3 [13:41:47] ottomata: not today, but can have a look tomorrow [14:17:49] ottomata: I think it is fine, as long as we mark the uid reserved :) [14:17:50] elukey: re service userrs in puppet or data.yaml [14:17:50] i think if we keep them in puppet, the classes will be more useable outside of places that mighht not include the admin module [14:17:50] e.g. cloud vps [14:17:50] so, ya, am proposing to keep system users like analytics-search in data.yaml, since those are really for use by real people decalred in data.yaml [14:17:50] but service/daemon users like hadoop and yarn in puppet. [14:17:50] am fine withi addibng placeholder comments in data.yaml [14:17:50] what do you think? [14:18:42] stat1008 is again in need of a logs cleanup, I opened a task yesterday [14:18:51] https://phabricator.wikimedia.org/T287339 [14:19:09] btullis: --^ if you want to have fun :) [14:19:21] elukey: will do. THanks. [14:19:42] 10Analytics: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) a:03BTullis [14:20:12] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10Ottomata) Hi @ChristineDeKock, FYI I stopped your jupyter notebook server on stat1008. Disks were filling up again. I'm not sure why your process was causing log... [14:20:16] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) [14:20:16] oh elukey btullis FYI ^ [14:20:28] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) p:05Triage→03High [14:21:20] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10Ottomata) ` sudo mv syslog.1 /home/elukey/T287339/syslog.1.again ` [14:24:22] btullis: elukey i'm going to also copy messages and userr.log again and trucate the existing ones [14:25:15] ottomata: OK. Thanks. [14:26:56] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10Ottomata) ` sudo -s cp syslog /home/elukey/T287339/syslog.again echo '' > syslog cp user.log /home/elukey/T287339/user.log echo '' > user.log ` [14:28:51] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10ChristineDeKock) Thanks; I'll keep everything shut down for now. [14:29:06] Isn't `/home/elukey`also on the root partition though? We're not clearing any space by moving them here, right? Or have I missed an automount or something similar? [14:29:31] btullis: /home symlinks to /srv/home [14:29:45] Ah, thanks. [14:30:09] basically we use this trick to allow rather big home dirs for experiments etc.. [14:30:40] we could have a more explicit mount / partition in theory, it would be clearer [14:33:04] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10Ottomata) ` find /tmp -mtime +100 -delete ` But that actually didn't delete that much. There seem to be a lot of model training temp files owned by various user... [14:33:33] Is this on all servers, or a certain subset? [14:33:53] btullis: only on stat100x [14:34:14] 👍 [14:34:40] (03CR) 10Milimetric: "Last I remember we decided to write AirFlow jobs instead of updating the old oozie ones. Did we decide something else and I missed it?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/706605 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [14:35:53] (03CR) 10Ottomata: "> Patch Set 4:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/706605 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [14:46:34] I don't seem to have +2 rights on labs/private - Can I fix this myself? [14:52:32] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) The vast majority of the `/home/elukey/T287339/syslog.again` and similar files appears to be Java stack traces, logged via bash. We have some errors fro... [14:55:00] btullis: it is strange, ldap/ops have access https://gerrit.wikimedia.org/r/admin/repos/labs/private,access [14:57:07] Oh, maybe it's missing CV1? No validation seems to have run. [14:57:15] https://usercontent.irccloud-cdn.com/file/T7YbluNp/image.png [14:59:35] btullis: can you hit "Reply" and then +2 +2 ? [14:59:52] after that you should see submit at the top right corner [15:00:23] Ah, sorry about that. [15:00:26] Done. [15:11:27] great :) [15:11:47] don't worry the first times are a bit confusing (no ci, self merge, self +2 +2, etc.. [15:11:56] it is part of the magic :D [15:27:36] ottomata: qq - am I needed for the tasking meeting?? [15:27:44] (we were wondering the same the other week) [15:31:47] elukey: probably not needed for tasking, but its always nice if you can come to the midweek sync [15:31:57] You're always welcome though elukey !! :) [15:32:01] sure! thanks :) [15:32:17] I don't want to avoid meetings with you folks, it is only that the ml backlog is big :D [15:34:06] 10Analytics: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 (10Ottomata) [15:36:13] 10Analytics-Clusters: Upgrade Druid to 0.20.1 (latest upstream) - https://phabricator.wikimedia.org/T278056 (10Ottomata) [15:37:30] 10Analytics-Clusters: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) a:03BTullis [15:37:37] 10Analytics-Clusters: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) p:05Triage→03Medium [15:40:03] 10Analytics-Clusters, 10Analytics-Kanban: Set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds - https://phabricator.wikimedia.org/T269616 (10Ottomata) p:05High→03Low [15:40:07] 10Analytics-Clusters, 10Analytics-Kanban: Set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds - https://phabricator.wikimedia.org/T269616 (10Ottomata) a:05razzi→03None [15:40:43] 10Analytics: Check home/HDFS leftovers of jkatz - https://phabricator.wikimedia.org/T287235 (10odimitrijevic) p:05Triage→03High [15:41:12] 10Analytics, 10Event-Platform, 10Patch-For-Review: EchoMail and EchoInteraction Event Platform Migration - https://phabricator.wikimedia.org/T287210 (10odimitrijevic) p:05Triage→03High [15:41:27] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: EchoMail and EchoInteraction Event Platform Migration - https://phabricator.wikimedia.org/T287210 (10odimitrijevic) [15:43:43] 10Analytics: Purge gobblin files - https://phabricator.wikimedia.org/T287084 (10odimitrijevic) p:05Triage→03High [15:43:47] 10Analytics-Clusters: Upgrade Druid to latest upstream (> 0.20.1) - https://phabricator.wikimedia.org/T278056 (10Ottomata) [15:46:39] 10Analytics-Radar, 10Product-Analytics, 10Growth-Team (Current Sprint): Add geolocation information to Growth schemas - https://phabricator.wikimedia.org/T287121 (10odimitrijevic) [15:51:14] 10Analytics-Clusters, 10Infrastructure-Foundations, 10SRE, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10Ottomata) [15:52:16] 10Analytics-EventLogging, 10Analytics-Radar, 10Platform Team Workboards (MW Expedition), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10odimitrijevic) Ping @Milimetric [15:53:31] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Product-Data-Infrastructure: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10odimitrijevic) p:05Triage→03High [15:59:26] 10Analytics, 10Analytics-Kanban: Crunch and delete many old dumps logs - https://phabricator.wikimedia.org/T280678 (10odimitrijevic) p:05Medium→03High [18:06:29] 10Analytics-EventLogging, 10Analytics-Radar, 10Platform Team Workboards (MW Expedition), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10Milimetric) I have only cursory knowledge of JsonSchemaContentHandler, but I can take a... [18:27:21] 10Analytics: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10JAnstee_WMF) [18:40:56] 10Analytics: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10Ottomata) From a purely structural perspective: if the datasets have differing schemas (columns in your spreadsheets case), then you'll likely want them to be different datasets (Hive tables) for sure. If they have the... [19:01:13] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) a:03Ottomata [19:18:27] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) [19:35:07] a-team: has anyone followed up on https://hue.wikimedia.org/hue/jobbrowser/#!id=0032619-210701181527401-oozie-oozi-W? It's the webrequest bundle and ideally we get it done soon, it's triggering the rest of the SLAs. I can take care of it if razzi's not around [19:36:31] milimetric: looking now [19:36:37] https://hue.wikimedia.org/hue/jobbrowser/#!id=0032619-210701181527401-oozie-oozi-W failed [19:36:38] haven't looked at it milimetric [19:36:44] it looks like it failed due to data loss [19:38:16] the data loss reported in the email was 1.28%, so barely over our threshold [19:38:26] (but that's still quite a few events) [19:38:52] jo-al nd elukey deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/705621 last week to fix this I believe [19:44:55] hm, that's true if these all turn out to be false positives. Also, if 5 minutes is not enough, and that problem re-occurs, we need to fix that some other way, longer delays aren't great. [19:45:32] hm [19:45:51] so perhaps I can just rerun this whole job and the sequence stats check would succeed this time? [19:46:00] trhying... [19:46:03] ottomata: lemme do the false positive check first [19:46:07] oh ok [19:46:07] waiting [19:46:09] ty [19:47:40] (job launched, takes a sec) [19:52:48] I get a bunch of these errors btw, in most spark jobs: WARN TransportChannelHandler: Exception in connection from /10.64.53.45:43158 followed by ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.64.53.45:43158 is closed [19:53:59] ok, finished, it says everything's a false positive [19:54:02] ottomata: I think it's safe to rerun [19:55:19] ok [19:55:37] rnning [19:55:38] https://hue.wikimedia.org/hue/jobbrowser/#!id=0032867-210701181527401-oozie-oozi-W [20:03:32] milimetric: it failed again [20:03:51] at check sequence stats [20:04:09] huh... weird [20:04:46] oh! no, it's not weird, because it doesn't have a false positive thing in the job itself, it still thinks there's 1.28% loss [20:05:00] hm, was there a way to force it... [20:07:20] oh hm [20:07:35] hm, well i guess we can just write hte _SUCCESS flag manually? [20:09:30] oh [20:10:13] milimetric: if I just add /wmf/data/raw/webrequest/webrequest_text/year=2021/month=07/day=26/hour=10/_SUCCESS [20:10:25] the dependent jobs should be unstuck, irght? [20:10:48] hm... I guess you're right, ottomata, but I wonder why we don't have that documented [20:11:29] oh, ottomata: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dealing_with_data_loss_alarms [20:11:47] it says there how to rerun with a higher error threshold, it's an oozie property, you just have to launch it manually [20:12:02] oh ok [20:12:44] (I just edited to change coord1001 -> launcher1002 [20:12:46] ) [20:13:05] danke [20:14:57] Thanks for looking in to that milimetric et al [20:19:06] np razzi, did you see the cassandra pageviews per article flat failure from this weekend? [20:19:42] 0032882-210701181527401-oozie-oozi-C [20:19:48] (subject line Fatal Error - Oozie Job cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-7-25) [20:19:52] I'm just starting to look at alert emails now, haven't seen that either [20:20:00] https://hue.wikimedia.org/hue/jobbrowser/#!id=0032882-210701181527401-oozie-oozi-C [20:20:05] (slowly working myself out of inbox bankruptcy) [20:20:21] hey, been there! :) Lemme know if you need a second pair of eyes on anything [20:21:00] thx ottomata! If you have to go, I'm working late so I can watch it [20:21:16] milimetric: do you want to brain bounce on how to have a good alert emails / ops week process? [20:21:33] razzi: sure, to the batcave! (in 30 sec I need some water [20:21:34] ) [20:21:37] I keep missing parens! [20:22:01] yeah give me a couple minutes [20:22:42] Here I'll clean up any stragglers [20:22:50] )]}> [20:48:36] milimetric: look slike job succeeded [20:48:37] ty [20:49:14] yep, I sent the email, thank you! [20:54:41] !log reran the failed workflow of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-7-25 [20:54:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:15:03] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: EchoMail and EchoInteraction Event Platform Migration - https://phabricator.wikimedia.org/T287210 (10nettrom_WMF) >>! In T287210#7231273, @Ottomata wrote: > @nettrom_WMF @MMiller_WMF Do either EchoMail or EchoInteraction need client_ip... [21:37:23] 10Analytics-Radar, 10MinervaNeue, 10Product-Analytics, 10Readers-Web-Backlog, 10Design: [Spike ??hrs] Sticky header instrumentation - https://phabricator.wikimedia.org/T199157 (10Jdlrobson) 05Stalled→03Declined Will cause confusion since this relates to mobile not desktop improvements. [23:21:37] 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) Yes, https://gerrit.wikimedia.org/g/operations/puppet/+/4a3bf542618f4550dfbe450452ddc9e6294ed1d3/modules/profile/manifests/analytics/jupyterhub.pp#61 is the cron But I'm not sure migrating it... [23:23:15] 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) Actually it's not sometimes, it's always missing. We've been getting this since the end of June at least, which is when I last cleaned out my root@ folder. [23:34:21] 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) >>! In T286442#7238156, @Legoktm wrote: > But I'm not sure migrating it to a timer fixes the underlying issue, which is that sometimes(?) `/srv/home` is missing. Nvm, it would. Even though the... [23:46:11] 10Analytics, 10SRE, 10Patch-For-Review: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) p:05Triage→03Low