[00:11:17] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.289% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:15:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:44] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:17] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.289% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:05:41] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) [06:07:31] !log pool druid1008 after reimage T332589 [06:07:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:07:34] T332589: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 [07:15:05] Hi everyone. What could be the reason of absence of data from China or Russia on this page https://stats.wikimedia.org/#/ru.wikipedia.org/reading/page-views-by-country [07:15:06] Sorry if you've already answered this question or this is a known bug. [08:11:17] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:13:24] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.289% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:06:48] newbee: see the country protection list https://wikitech.wikimedia.org/wiki/Country_protection_list [09:14:12] Oh, got it. Thanks a ton for the help [10:09:01] 10Data-Engineering: Turnilo: invalid transforms on wmf_netflow dashboard - https://phabricator.wikimedia.org/T351731 (10ayounsi) This issue has proven problematic again today to troubleshot a paging issue. [10:09:50] we started receiving many emails about missing SLAs for airflow tasks starting this morning 7AM [10:10:20] joal is that on your radar? Anything we can help with? [10:11:25] brouberol: I suspect (although I haven't looked into it much) that it's another case of this bug striking. T351909 [10:11:26] T351909: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 [10:14:03] There's an email about it here: https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/message/6TPJDDKTOBVAQACXYNATXBTPCNN2J4FP/ but it's not easy to see the signal in all the noise on this list. [10:16:01] hello folks [10:16:17] https://phabricator.wikimedia.org/T351731 needs some attention, we are missing some important metrics for DDoS detection [10:16:18] Morning Luca :-) [10:16:21] morning :) [10:16:24] not sure what happened [10:17:01] Agreed. [10:17:15] 10Data-Engineering: Turnilo: invalid transforms on wmf_netflow dashboard - https://phabricator.wikimedia.org/T351731 (10elukey) Example of the issue: https://w.wiki/8Fo7 The Region is null for the recent data (likely the one real-time indexed directly from Kafka to Druid). [10:22:57] 10Data-Platform-SRE: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) [10:23:51] lol now it works [10:23:55] after a turnilo restart [10:24:08] maybe it was needed due to the recent druid hosts changes? [10:25:05] Doh! Sorry I didn't think of this when it was originally reported. I'd assumed it was an issue with the indexing and would need a deep dive. [10:25:31] 10Data-Engineering: Turnilo: invalid transforms on wmf_netflow dashboard - https://phabricator.wikimedia.org/T351731 (10elukey) After a restart of turnilo I see the datacube back to its original state. Maybe Turnilo needed a restart after the an-druid hosts moves? [10:25:46] btullis: yeah me too, but turn-off-on worked again :D [10:26:47] 10Data-Engineering: Turnilo: invalid transforms on wmf_netflow dashboard - https://phabricator.wikimedia.org/T351731 (10elukey) 05Open→03Resolved a:03elukey We can probably close, let's reopen in case it re-occurs. [10:32:37] 10Data-Platform-SRE: Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) [10:39:35] 10Data-Platform-SRE: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) [10:40:39] 10Data-Platform-SRE: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10BTullis) [12:04:43] 10Data-Engineering: Turnilo: invalid transforms on wmf_netflow dashboard - https://phabricator.wikimedia.org/T351731 (10JAllemandou) Thanks a lot @elukey for quickly troubleshooting <3 [12:05:02] Good morning brouberol - sorry I missed your earlier ping [12:05:14] we're aware of the issue, btullis was right in his analysis [12:14:50] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.396% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:34:17] ack thanks [12:34:57] !log Rerun webrequest refine text for 2023-11-23T17 [12:34:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:51:36] btullis I added documentation about how to use the datahub CLI https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub#Via_the_CLI [12:51:58] with the export trick you mentioned yesterday to validate the x509 certificate chain [13:19:52] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10brouberol) ` root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e dse-k8s-eqiad -i apply helmfile.yaml: basePath=. skipping missing values file matching "calico/val... [13:20:03] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10brouberol) 05Open→03Resolved [13:20:06] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10brouberol) [13:20:08] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:20:26] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10brouberol) ` brouberol@deploy2002:~$ kubectl get namespaces | grep spark-history spark-history Active 54s spark-history-test Active 54s ` [13:20:59] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:21:33] stevemunene btullis could I have eyes in this small CR please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/976733/ [13:23:17] s/in/on [13:28:20] 10Data-Platform-SRE: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10brouberol) We can probably take inspiration from https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server as well [13:31:00] Looks good, just re-running pcc to check that the puppet 7 failure was transient. [13:32:06] Nice work on the datahub docs too. :+1 [13:33:16] I aalso don't quite get the puppet 7 failure [13:36:54] I've seen a couple of puppet 7 related issues in the last couple of days. It felt like an unrelated issue affecting the CI at the time [13:39:06] pcc failed again on puppet7 for the same host, very odd. The change looks good anyway, but I'd be tempted to wait until Monday to deploy. [13:40:14] sure, no worries [13:41:31] I have a question related to kerberos keytabs for kubernetes services in https://phabricator.wikimedia.org/T351816, if any of you knows kerberos more than me (aka, at all) [13:43:16] Ah yes, I did see that question :-) I hadn't really started thinking about it yet, but we will need to. [13:43:55] the general idea would be to generate a keytab for that service (with an agreement on the principal name), and add it base64-encoded to the puppet secret hieradata, so we could then pull it into our chart and `{{ $Values.secrets.keytab | b64dec }}` it in helm [13:44:09] no rush at all, it's just so it does not get lost [13:49:39] Yes, that's exactly the approach I was thinking about too. It occurs to me that we will probably also want to mount `/etc/krb5.conf` into the pod as well. [13:50:11] There are no secrets in there though, just default settings. [13:53:52] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) Note for later > It occurs to me that we will probably also want to mount `/etc/krb5.conf` into the pod as well. [13:59:49] brouberol: there are some checks, IIUC, that the fqdn used in the keytab is indeed where the requests come from [14:01:10] people created stuff like https://engineering.linkedin.com/blog/2020/open-sourcing-kube2hadoop to workaround some issues, essentially having the hadoop tokens distributed to pods [14:01:19] (bypassing kerberos principals etc..) [14:02:07] (not saying it shouldn't work, but to check how the keytab is validated once issued) [14:07:04] oh so the kerberos server tries to resolve the principal and checks that it resolves to the IP of the requester? [14:09:00] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10BTullis) If you look at the output of the following command on `krb1001` you can see the whole list of our existing principa... [14:10:20] elukey: Oh, I didn't know about these checks for the fqdn in the keytab. [14:12:38] brouberol, btullis - yes there are checks like those, this is why for example we need to create keytabs specific to hosts. I don't recall where/when those check happens, since it has been a while :D, but there should be documentation about it.. [14:13:18] it should be a matter of checking that fqdn resolves to the IP from which the auth request comes from [14:21:00] Thanks, that’s super helpful! [14:22:16] I have found this info that mentions the fqdn requirement: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Kerberos_principals_for_Hadoop_Daemons [14:23:55] ...but that seems to be specific to when using kerberos to secure the daemons themselves. In our case the spark history server is just an HDFS client, so I'm wondering whether it is the same. [14:26:53] so this may be stale knowledge, but IIRC when the krb client issues a request for a TGT it has to contact the krb control plane, providing the credentials [14:27:13] and at that point, the checks are performed [14:27:28] (like what happens on stat100x when analytics-privatedata authenticates) [14:27:37] then /tmp/krb_something (IIRC) gets populated [14:57:27] OK, thanks elukey. This is all good stuff. I found this as well: https://web.mit.edu/kerberos/krb5-latest/doc/admin/princ_dns.html#overriding-application-behavior [15:00:21] <3 [15:40:09] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10BTullis) We've been having some fruitful discussion with @elukey about this and it seems that it may be a little trickier to... [15:42:11] 10Data-Engineering, 10Data Products: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10mforns) > @mforns what kind of information would you need to help troubleshooting? Would knowing that a partition... [15:43:14] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10BTullis) 05Open→03Declined Declining as per discussion. I have reset the default value of `max_active_runs_per_dag` back to 1 as suggested. [16:10:09] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) 05Open→03Resolved [16:11:30] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10BTullis) 05Open→03Resolved I've finished this removal and I've had a good crack at archiving/updating related oozie docs in Wikitech. [16:14:51] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.396% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:00:14] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10mpopov) Thanks, Ben! I'd like to expand the section Joseph added. (Thank you, Joseph!) I have some questions/suggestions in https://wikitech.wikimedia.or... [20:18:24] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.397% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace