[00:06:07] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) @bmansurov : right, you'd use the deploy server (`deploy1002`) to deploy into https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#rese... [00:09:41] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10xcollazo) I can see that the changes are now live since the `research` instance now has its own cache folder: ` xcollazo@stat1007:~$ hdfs dfs -ls /wmf/cache/arti... [01:26:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:47] PROBLEM - Check unit status of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:05:39] btullis: o/ [07:06:12] The info on the kubernetes/new wikipage may not be correct, the kubelets on dse-k8s-ctrl nodes are down due to problems with the labels [07:07:31] --node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the specifically allowed set (beta.kubernetes.io/arch, beta.kubernetes.io/instance-type, beta.kubernetes.io/os, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone, failure-domain.kubernetes.io/region, [07:07:36] failure-domain.kubernetes.io/zone, kubernetes.io/arch, kubernetes.io/hostname, kubernetes.io/instance-type, kubernetes.io/os) [07:10:02] I have updated the tutorial [07:29:41] there seems to be a problem on /mnt/nfs mounts on stat boxes, I believe due to a recent change made by John to have root:root as default for files/dirs in our manifests [07:29:58] not sure why but /mnt/nfs dirs are owned by uid 400 [07:30:01] weird [07:30:06] maybe umount/remounting should fix? [07:30:16] elukey: we've found that yesterday - I think btullis is on it [07:30:33] I have no idea as to why the owner has been changed :( [07:36:45] the change was https://gerrit.wikimedia.org/r/c/operations/puppet/+/809095/3/manifests/realm.pp [07:37:02] maybe the id 400 is an old thing [07:38:41] I think that it should work simply umounting and running puppet [07:38:54] so that the dirs are chowned to root:root [07:39:17] can I try on stat1004? [07:39:29] elukey: stat1008 got rebooted yesterday, and the problem still occurred - expected? [07:41:03] joal: in theory yes, because puppet runs after the /mnt/nfs mounts are added, and it fails [07:41:07] (ro file system etc..) [07:41:15] Ah! [07:42:42] joal: ok if I try to umount + run puppet on 1004? [07:43:16] elukey: I'm very ok for you to do everything you wish, but would mind waiting for btullis, so that we keep him involved? [07:43:25] sure sure [07:43:32] Thank you :) [08:54:57] I'm around now. Sorry for the delay. Yes my investigation was leading toward the resource defaults as well, but I hadn't found the smoking gun yet. [08:56:15] I tried an unmount on stat1004 as well. As soon as it is unmounted the ownership returns to root:root, remounted it becomes 400:400 again. [09:02:29] :( [09:03:10] I don't see anything specific in fstab that could point to that [09:04:17] I think that's just normal NFS behaviour, it takes on the owner of the inode of the source filesystem. [09:04:17] There are a couple of fixes suggested by andrewbogot.t here: https://phabricator.wikimedia.org/T317359#8222935 [09:06:00] elukey@clouddumps1001:~$ id -nu 400 [09:06:00] dumpsgen [09:06:01] yeah [09:06:16] the fix is probably ok, really sad that we have to do it [09:06:18] sigh [09:09:50] Yeah, neither option is exactly appealing. I'd probably rather set the uid in the file resource than use an exec. At least the dumpsgen user is fixed at 400. https://phabricator.wikimedia.org/T317359#8222935 [09:10:11] Wrong link. This one: https://github.com/wikimedia/puppet/blob/production/modules/dumpsuser/manifests/init.pp#L6-L8 [09:14:05] Maybe I'll try requiring the dumpuser class on the stat boxes, so at least 400:400 will resolve to dumpsgen:dumpsgen [09:21:38] It's happening on an-launcher1002 as well [09:47:02] 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10bmansurov) Great! [09:47:50] The fix for the NFS mounts has been applied. https://phabricator.wikimedia.org/T317359#8224237 [09:48:05] elukey: Thanks also for sorting out the k8s node labels. 👍 [09:53:38] btullis: very nice! [09:55:59] elukey: Thanks <3. Would you like me to make the change to add knative to dse-k8s today? As per your comment here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/826836 [10:00:04] btullis: nono we can wait for knative, no rush :) [10:02:09] OK, cool. Looks like there is currently an istio problem and an rsyslog problem to sort out on this cluster. So I'll be looking into those today. [10:02:24] lemme know if you need help [10:03:16] Thanks. I will. [10:18:04] joal: found this today https://github.com/linkedin/feathr [10:18:26] Ooooh :) Interesting! [11:08:14] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:08:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:37] 10Data-Engineering, 10Event-Platform Value Stream, 10Wikimedia-production-error: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343 (10Aklapper) +#eventgate (please add codebase project tags so such tasks show... [11:23:52] (03PS1) 10Vivian Rook: Set repo to readonly [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/831070 (https://phabricator.wikimedia.org/T308978) [11:24:15] (03CR) 10CI reject: [V: 04-1] Set repo to readonly [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/831070 (https://phabricator.wikimedia.org/T308978) (owner: 10Vivian Rook) [11:52:09] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10phuedx) [12:19:35] (03CR) 10Vivian Rook: [V: 03+2 C: 03+2] Set repo to readonly [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/831070 (https://phabricator.wikimedia.org/T308978) (owner: 10Vivian Rook) [12:22:54] (03CR) 10CI reject: [V: 04-1] Set repo to readonly [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/831070 (https://phabricator.wikimedia.org/T308978) (owner: 10Vivian Rook) [12:24:32] 10Quarry, 10GitLab (Project Migration), 10Patch-For-Review: Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10rook) a:03rook [12:49:14] 10Quarry, 10GitLab (Project Migration), 10Patch-For-Review: Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10Aklapper) [13:58:58] (03PS2) 10Mforns: Add metric_id column to Wikidata EntitySchema text HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817837 (owner: 10Michael Große) [13:59:04] (03CR) 10Mforns: [V: 03+2] Add metric_id column to Wikidata EntitySchema text HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817837 (owner: 10Michael Große) [14:17:51] 10Quarry: investigate blubber on github actions - https://phabricator.wikimedia.org/T317414 (10rook) [14:23:00] 10Quarry: investigate blubber on github actions - https://phabricator.wikimedia.org/T317414 (10rook) https://github.com/toolforge/quarry/pull/5 [14:23:16] 10Quarry: test tox on PR - https://phabricator.wikimedia.org/T317092 (10rook) 05Open→03Resolved [14:23:26] 10Quarry, 10GitLab (Project Migration): Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10rook) [15:34:02] 10Data-Engineering, 10Event-Platform Value Stream, 10Wikimedia-production-error: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343 (10cjming) [15:37:13] 10Quarry: investigate blubber on github actions - https://phabricator.wikimedia.org/T317414 (10rook) 05Open→03Resolved [15:37:46] 10Quarry, 10GitLab (Project Migration): Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10rook) 05Open→03Resolved [15:48:33] 10Data-Engineering, 10Event-Platform Value Stream, 10Wikimedia-production-error: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343 (10cjming) [15:55:08] 10Data-Engineering, 10Event-Platform Value Stream, 10Wikimedia-production-error: Eventgate error: '' should have required property 'database', '' should have required property 'performer' - https://phabricator.wikimedia.org/T317343 (10cjming) [16:01:12] (03CR) 10Joal: [C: 03+2] "LGTM - I must have been sleepy when I let this go through :S Sorry about that" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817837 (owner: 10Michael Große) [16:02:03] 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10MW-1.40-notes (1.40.0-wmf.1; 2022-09-12): Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx) [16:03:10] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [16:03:24] 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10MW-1.40-notes (1.40.0-wmf.1; 2022-09-12): Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx) 05Open→03Resolved a:03phued... [16:44:57] 10Analytics, 10API Platform (Product Roadmap), 10Code-Health-Objective, 10Epic, and 3 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10BPirkle) [22:42:01] 10Quarry: test irc integration - https://phabricator.wikimedia.org/T316961 (10rook) 05Open→03Resolved [22:42:07] 10Quarry, 10GitLab (Project Migration): Move Quarry from Gerrit to GitHub - https://phabricator.wikimedia.org/T308978 (10rook)