[00:11:09] (03PS1) 10GoranSMilovanovic: T294985 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739377 [00:11:20] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T294985 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739377 (owner: 10GoranSMilovanovic) [00:49:01] (03PS1) 10GoranSMilovanovic: T294984 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739380 [03:41:18] (DruidSegmentsUnavailable) firing: More than 30 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [03:41:18] (DruidSegmentsUnavailable) firing: More than 20 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [03:51:18] (DruidSegmentsUnavailable) resolved: More than 30 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [03:51:18] (DruidSegmentsUnavailable) resolved: More than 20 segments have been unavailable for webrequest_sampled_128 on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [07:28:18] !log `sudo pkill -U jmixter` on stat100[5,8] to allow puppet to run and remove the offboarded user [07:28:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:29:08] !log `apt-get clean` on an-tool1005 to free space in the root partition [07:29:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:31:08] there are 4.5GB of space used in ottomata's home dir, mostly old venvs, we can follow up later on with Andrew about the clean-up (root partition usage around 86% now) [07:41:42] 10Analytics, 10LDAP-Access-Requests, 10SRE: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10Peachey88) @CGlenn I would recommend filing separate requests for each team member using the template from here: https://phabricat... [09:36:51] (03CR) 10DCausse: [C: 03+2] Add performer field to sparql/query [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/735445 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [09:37:49] (03Merged) 10jenkins-bot: Add performer field to sparql/query [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/735445 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [09:41:21] (03PS1) 10Elukey: gobblin: use the new jks TLS bundle to validate certificates [analytics/refinery] - 10https://gerrit.wikimedia.org/r/739475 (https://phabricator.wikimedia.org/T291905) [10:03:16] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) As @Ottomata mentioned there is a common schema for log messages called ECS (Elastic Common Schema), which the observability team... [10:19:31] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The 10th snapshot has finished loading and compactions are now in progress from all 12... [10:48:13] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) If we look at another host that is not in the list, but was purchased and installed at... [10:53:32] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) 05Open→03Resolved Committed. The results are here: https://netbox.wikimedia.org/ext... [11:06:07] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) Thanks a lot! [11:08:01] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10BTullis) [11:10:58] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Results have expired error in Hue - https://phabricator.wikimedia.org/T294144 (10BTullis) 05Open→03Resolved [11:11:15] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) 05Open→03Resolved [11:11:20] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10BTullis) [11:28:27] I'm planning on restarting archiva soon, as part of https://phabricator.wikimedia.org/T295673 [11:28:27] Nothing special I need to know, is there? Just `systemctl restart archiva.service` and check that it comes back? [11:28:44] exactly [11:29:23] I usually double check archiva.wikimedia.org, and maybe one authentication [11:31:02] elukey: Did you see that there is a question for us on https://phabricator.wikimedia.org/T295118#7509294 [11:31:02] I've not even been aware of furud.codfw.wmnet until just now, but I don't think we need to do anything about it, apart from maybe downtime it in Icinga. Would you agree? [11:39:40] btullis: yes yes it is a misc host used for backups a long time ago, nothing to do! [11:39:53] (downtime is fine, it is basically a client) [11:40:58] elukey: Thanks. It's not even a very good client at the moment :-) [11:41:03] https://www.irccloud.com/pastebin/IbJmGXj8/ [11:41:21] ...but I'm not going to worry about that now. [11:43:30] Andrew is the best poc for those nodes, not sure if we really need them or not [11:43:36] (lunch :) [11:44:43] !log btullis@archiva1002:~$ sudo systemctl restart archiva.service [11:44:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:00:51] I also need to roll-restart the presto cluster. Any objection to my using the cookbook for it today? I'll also add some expanded notes for it to https://wikitech.wikimedia.org/wiki/Service_restarts and https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration#Roll_restart_the_Presto_cluster [12:12:04] !log roll-restarting the presto analytics workers [12:12:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:27:36] 10Analytics, 10LDAP-Access-Requests, 10SRE: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10Aklapper) 05Open→03Invalid Basically what Peachey88 wrote - please split this task, and use the template link to fill out the... [14:22:00] hello folks, qq - what do you think about kube-dse100[1-4] as names for the kubernetes worker nodes? [14:22:08] didn't come up with anything fancier [14:22:27] (and I have to tell to dcops today how we are calling them :D) [14:23:57] elukey: That's fine by me. Nice and inclusive. What about the control plane nodes? Are there three of those too? [14:25:07] Oh no, my mistake. It's 4 servers, not 7. (T286594) I thought that the control plane servers were different. [14:26:29] (03PS2) 10AKhatun: Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) [14:32:06] (03CR) 10jerkins-bot: [V: 04-1] Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) (owner: 10AKhatun) [14:32:09] btullis: control plane will be on ganeti, I am thinking kube-dse-ctrl100[1,2], what do you thinl? [14:32:13] *think [14:32:39] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10JAnstee_WMF) Thanks again, @Ottomata! [14:36:35] How about keeping the dse bit next to the numbers, like: kube-node-dse100[1-4] & kube-ctrl-dse100[1,2] [14:39:57] Can you get away with only 2 control plane servers? I thought there had to be 3 for etcd quorum? [14:41:42] so there will be other 3 for etcd nodes [14:41:47] for ml-serve we have [14:41:49] Or `dse-` as a prefix as per the `an-` prefix? [14:42:00] - ml-serve100[1-4] [14:42:21] - ml-serve-ctrl100[1,2] (k8s control plane daemons) [14:42:31] - ml-etcd100[1-3] - etcd cluster [14:42:39] the last two sets are on ganeti [14:47:51] the current names for prod are https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Servers [14:48:03] Ah, cool. Understood, makes perfect sense. So you have your team name `ml-`as the prefix, I'd stick with that: [14:48:03] - dse-kube-node100[1-4] [14:48:03] - dse-kube-ctrl100[1,2] [14:48:03] - dse-kube-etcd100[1-3] [14:48:03] ...but I don't have very strong feelings about it. [14:48:37] ottomata: ---^ o/ :) [14:49:00] dse-k8s is also a valid prefix [14:49:27] we could probably avoid the -node part, but I am happy with whatever the quorum decides [14:50:20] additional data point: we use -worker (and -control) for wmcs clusters [14:50:35] ah nice! [14:50:44] > I am happy with whatever the quorum decides [14:50:44] Me too :-) [14:51:15] let's wait for Andrew [14:51:30] Yep. [15:08:03] (03PS3) 10AKhatun: Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) [15:09:10] Andrew is oiff today. [15:09:16] ...off [15:40:11] btullis: aaahhh okok :) [15:40:25] so we can figure out naming [15:43:44] hello, i did some tests with commons dumps ~14gb. i deleted them now. since its quite large, do i need to do something to also delete it from trash? [15:51:07] I think it is fine tanny411, it will be cleaned up by hadoop and 14gb should be ok :) [15:51:30] thanks for asking! [15:51:40] great [15:53:14]  [15:58:11] elukey: I like majavah's observation about using `-worker` and `-control` suffixes, in keeping with wmcs. [15:58:11] I don't have strong feelings about the `-kube-` or `-k8s` infix. I like the `dse-` prefix. [16:01:41] !log roll-restarting kafka-test brokers [16:01:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:04:49] ok we can go for dse-k8s-worker in this case [16:05:10] quick note - I had a chat with serviceops, and I had some misconceptions about the pod ips [16:05:42] we assign subnets for pods that are not associated with any private/analytics/etc.. vlan [16:05:53] and we use calico+BGP to make everything working [16:06:37] so given that the pod subnets don't need to be in the analytics vlan, the underlying worker nodes can be in the private vlan [16:06:59] and we'll need to add the pod subnets (when configured) in the ferm rules to allow that traffic part (for the services that we'll need) [16:08:32] Ah, great. Yes I remember hearing about that calico/BGP mechanism in a.kosiaris' talk recently. [16:15:55] > we'll need to add the pod subnets (when configured) in the ferm rules to allow that traffic part (for the services that we'll need) [16:15:55] What about the homer rules? We'll still need these, won't we? Or is it just *out* of the analytics vlan that needs to be configured in homer? [16:22:31] only traffic going from analytics to production [16:23:08] in theory, as starting point, we should have only use cases of k8s -> analytics services/nodes [16:23:45] but in case we need to contact k8s services from analytics we'll need to add homer rules [16:24:05] does it make sense? (this is how I view the current requirements) [16:26:11] Yes, that makes perfect sense. Thanks. [17:11:19] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10JAnstee_WMF) [[ https://docs.google.com/document/d/1ESDYV3SwGNQERXfKFbHPBkGmojcy_vbNAc3WKMuvCKo/edit# | Here are the meeting notes ]] from our sync yesterday. **Actio... [17:29:26] elukey: Shall we skip SRE sync? [17:31:33] btullis: sorry I am still in an interview (finishing), if you don't have anything to discuss we can skip [17:31:41] (I'd join in 5 mins max) [17:31:54] We're geeking out on data catalogs, so that's fine by me. +1 [17:32:04] perfect, let's skip :) [17:32:26] ack. thanks. [19:41:39] (03CR) 10Joal: "Comments about comments, and a suggestion to prevent copy/pasting - Thanks a lot for the patch Aisha :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) (owner: 10AKhatun) [22:48:11] (03PS1) 10Clare Ming: Update web_ui_scroll schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739659 (https://phabricator.wikimedia.org/T294246) [22:50:37] (03PS1) 10Clare Ming: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739661 (https://phabricator.wikimedia.org/T294777) [22:54:25] (03Abandoned) 10Clare Ming: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739661 (https://phabricator.wikimedia.org/T294777) (owner: 10Clare Ming) [22:56:19] (03PS1) 10Clare Ming: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739666 (https://phabricator.wikimedia.org/T294777) [23:01:57] (03CR) 10Nray: [C: 03+2] Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739666 (https://phabricator.wikimedia.org/T294777) (owner: 10Clare Ming) [23:02:55] (03Merged) 10jenkins-bot: Update web_ui_reading_depth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739666 (https://phabricator.wikimedia.org/T294777) (owner: 10Clare Ming) [23:08:24] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 3 others: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10cjming) a:05cjming→03None