[00:35:14] 10Data-Engineering, 10Data-Engineering-Jupyter: support Python >=3.11 on conda-analytics - https://phabricator.wikimedia.org/T346673 (10nshahquinn-wmf) [01:17:36] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [03:00:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [04:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:36] (SystemdUnitFailed) resolved: monitor_refine_event_sanitized_main_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:30] 10Data-Platform-SRE: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) From the previous tickets, the steps are roughly [] Create Keytabs [] Add the hosts to the `role(druid::public::worker)` [] druid1009 [] druid1010 [] druid1011 [05:48:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:47] FYI, I'll be rebooting the systems behind hue/yarn/superset/turnilo in the next ~ half hour, they may individually be unavailable for ~ a minute, will send a note when they're all done [07:00:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [07:00:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [07:12:19] these are all complete now [07:16:35] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10awight) @BTullis @Gehel Let us know how we can help with the decision about where to publish this data. Everything is ready on our side. [07:17:12] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on an-tool1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:22] * brouberol waves good morning [07:20:58] (SystemdUnitFailed) resolved: (2) ifup@ens13.service Failed on an-tool1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:55] Oscar is sick and staying home, my availability will be reduced today. [08:19:14] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Gehel) 05Open→03Resolved [08:19:17] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10Gehel) [08:19:47] moritzm: Many thanks for handling those reboots [08:21:01] 10Data-Platform-SRE, 10decommission-hardware: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Gehel) Still need to delete keytabs from private puppet repo. [08:23:53] yw :-) [08:26:18] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] Investigate what happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10BTullis) [08:29:09] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10BTullis) Should this be a child of {T345698} or should they be merged, I wonder? [08:37:58] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) Thanks @Milimetric [09:00:02] brouberol: We have a kafka-under-replicated partitions alert at the moment. Is this expected? https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DKafkaUnderReplicatedPartitions [09:00:36] not that I know of. I haven't run any operation since yesterday aftermoon [09:01:03] OK, cool. Let's investigate :-) [09:04:08] I'm seeing `the leader reported an error: CLUSTER_AUTHORIZATION_FAILED` errors on the newly provisioned brokers [09:04:27] and they were assigned partitions randomly since yesterday [09:04:33] kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --under-replicated-partitions --describe [09:04:33] Topic: codfw.mediawiki.job.ArticleChangedJob Partition: 0 Leader: 1003 Replicas: 1003,1014,1007 Isr: 1003,1007 [09:04:33] Topic: codfw.mw_page_content_change_enrich.error Partition: 0 Leader: 1004 Replicas: 1004,1009,1012 Isr: 1004,1009 [09:10:05] gmodena: Hey! There is a MediawikiPageContentChangeEnrichJobManagerNotRunning alert. Would you know what that is and what to do about it? It seems stuck since 15:40 UTC yesterday [09:10:14] do we have to somehow set ACLs for newly provisioned brokers? cc elukey [09:10:32] gehel checking [09:10:55] that's probably a task for opsweek, so ping jennifer_ebe (and joal) [09:13:14] gehel might be DC switchover related? [09:13:35] I see we now consume/produce events from codfw https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich&from=now-24h&to=now [09:14:22] might be. I'm not sure what the data flows are on this job. Do we only consume from the local DC? Or do we consume the replicated topics? [09:14:46] gehel they consume from the local dc [09:14:53] The alert name makes it sound like the job itself is not running, but it might actually just mean that there is no data. [09:15:17] and that job is running on wikikube, not on the dse k8s cluster? [09:15:25] nope, the application is actually down in eqiad [09:15:42] it run on wikikube [09:16:06] ok, so it might be related to the switch, but there is really an issue [09:16:14] how critical is this? [09:17:16] gehel not critical, there is no (apparent) data loss. The job started consuming and producing from/to codfw [09:17:33] but I'm looking into why the application failed [09:18:21] ok, I'll send an update in #wikimedia-sre and let you follow up on it. Keep me posted if you find anything, or ping btullis if you need SRE support! [09:18:50] gehel btullis ack. I followed up to the alert mail [09:18:58] gehel thanks for the ping! [09:19:04] my pleasure! [09:19:08] I have a sense that the old kafka brokers haven't been restarted to take the new ones into account in their configuration [09:24:12] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) Thanks a lot for digging @xcollazo! >>! In T343844#9179732, @xcollazo wrote: > Notice how the `--archives` flag points... [09:35:34] gmodena are you familiar with how we have setup kafka acls? It seems that our new brokers can't fetch assigned partitions, due to kafka.common.errors.ClusterAuthorizationException errors [09:36:17] given that they have the exact same role than other brokers, I'm guessing we have to enable them into the cluster somehow? [09:36:18] brouberol i'm not sorry. AFAIK it's one of the bits we tweak out of puppet [09:36:32] possibly elukey knows more [09:36:39] np thanks [09:37:17] brouberol: There is some documentation here too, in case it helps. https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_ACLs [09:38:47] brouberol: no in theory we don't need to set extra ACLs [09:39:21] then I have no clue as to why the new brokers are not authorized to fetch partitions :/ [09:39:48] btullis: I have read that, but I haven't found anything related to broker-specific ACL [09:40:19] brouberol: CN = Puppet CA: palladium.eqiad.wmnet [09:40:30] did you add a broker with a cergen certificate? [09:40:57] we have PKI to handle certs, they should be automatically provisioned [09:41:15] ah no wait those are accepted CAs, my bad [09:41:17] not that I know of, but I wasn't the one who installed the hosts themselves. I mainly added them to puppet [09:41:30] the cert is good on 1015 sorry, pebcak [09:41:37] ack [09:41:59] where do you see the errors? (so I can check) [09:42:13] if that helps, this is what I see on a partition leader when an un-auth-ed broker tries to fetch the partition: org.apache.kafka.common.errors.ClusterAuthorizationException: Request Request(processor=4, connectionId=10.64.16.99:9093-10.64.135.16:41834-2157, session=Session(User:CN=kafka-jumbo1014.eqiad.wmnet,/10.64.135.16), [09:42:13] listenerName=ListenerName(SSL), securityProtocol=SSL, buffer=null) is not authorized. [09:42:23] gehel jennifer_ebe looks like the mw-page-content-change-enrich app in eqiad is failing to startup because it can't reach swift https://logstash.wikimedia.org/goto/ce1765e186329ed74f179d375f8df182. We need swift for HA, so the k8s operator throws in the towel at boot. [09:42:39] ^ on kafka-jumbo1003.eqiad.wmnet for this specific error [09:43:17] and how many brokers did we add? [09:43:41] we added 1010->1015 [09:43:50] (1015 included) [09:43:59] gehel jennifer_ebe I'll f/up with data persistence [09:44:01] hello gmodena, what can i do to fix? [09:44:18] okay thank you gmodena [09:44:54] it turns out that 2 newly created topics had their partitions randomly assigned to broker 1012 and 1014, but they aren't authed to fetch the partition data . I'm assuming that'd be the case for all 1010->1015 brokers as well [09:45:08] brouberol: we use `super.users` in the kafka config to state what brokers can communicate with the other ones, I am wondering if we have restarted the brokers after it got changed [09:45:40] right, that was my hypothesis as well: [09:45:40] > I have a sense that the old kafka brokers haven't been restarted to take the new ones into account in their configuration [09:45:55] yeah I think it is the likely culprit [09:45:58] so we could try to restart kafka-jumbo1003 and see if that works [09:46:09] and if it does, perform a cluster RR [09:46:13] +1 exactly, please downtime it via cookbook (For say 10 mins) [09:46:20] otherwise it pages if anything goes wrong [09:46:32] on it [09:47:01] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, 10WMDE-FUN-Sprint-2023-09-04: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10kai.nissen) @Mayakp.wiki Our current (and future) live banners don't call `hi... [09:48:48] downtime set. I'll restart the kafka service on kafka-jumbo1003 [09:50:51] done. 1003 is back in full sync [09:51:08] \o/ [09:53:37] (btullis: not so fast, 1003 is the one that I manually restarted. The other 2 brokers are still denied from fetching data. The issue is that the partition leader used to be 1003, we restarted it in the hope that it would then allow the broker 1012 to fetch data, as it'd have read tis new config, but leadership has been transfered to a new replica [09:53:37] :D) [09:54:50] however, good news. 1003 is back to being the leader, and 1014 was now authorized to fetch data! Meaning we need to perform a full cluster rolling-restart [09:55:11] thanks elukey for confirming my un-educated hypothesis [09:55:38] np! [09:56:05] Yeah, I was all \o/ for the working out stuff. :-) [09:56:14] btullis: is there any particular precaution we need to take before running the sre.kafka.roll-restart-brokers cookbook? [09:56:51] (speaking of, https://gerrit.wikimedia.org/r/c/operations/puppet/+/959162 should make cluster rolling restarts safer and/or faster) [09:59:03] (still a WIP, but just FYI) [09:59:55] Nothing special. There is a nice message telling you to check Grafana for 'expectedness ' (not a real word) of things before typing go, but I think that's it. [10:00:46] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [10:03:10] (03CR) 10Phuedx: Remove unused schemas (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight) [10:05:06] alrighty, the rolling-restart is ongoing. It should take a good part of the day, as each step takes about 20 minutes, and we have 15 of them [10:08:02] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) Thanks @Stevemunene - The plan looks good, but we also need to take into account the zookeeper cluster which is co-located with this Druid cluster. We are refreshing druid100[4-6]... [10:12:44] gehel jennifer_ebe found the root cause. Config issue on the app side that messed up cross DC network calls to swift. Patch incoming [10:15:51] okay thank you! gmodena [10:18:57] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10gmodena) [10:30:25] (03PS2) 10Phuedx: Remove unused schemas [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight) [10:32:59] (03CR) 10Phuedx: "I've been **bold** and rebased/updated this per the results of my script." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight) [10:42:32] (03PS3) 10Phuedx: Remove unused schemas [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight) [11:00:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [11:00:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [11:02:04] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) > I think puppet will set up the load-balancing automatically (by virtual of the `profile::lvs::realserver` being applioed) but it may be necessary to notify `pybal` of the ch... [11:05:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10gmodena) [11:07:29] 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10BTullis) Hi @awight - Apologies for the delay in getting back to you about this. I've checked and everything is fine for you to proceed from our side and your unders... [11:11:41] 10Data-Engineering, 10Product-Analytics: Need data insight on the Hindi Wikipedia and Wikisource Edit-a-thon - https://phabricator.wikimedia.org/T345655 (10KCVelaga_WMF) [11:25:00] btullis: the rolling restart is still ongoing, but the under replicated partition alert has resolved [11:25:27] however, I'm seeing spikes of offline partitions around the restarts, meaning we possibly have partitions with a replication factor of 1 [11:27:16] brouberol: Great. Can we get a list of these topics with replication factor 1, if there are any? There's probably quite a bit of cruft because we auto-create. [11:32:41] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) >>! In T336042#9182474, @Stevemunene wrote: > >> I think puppet will set up the load-balancing automatically (by virtual of the `profile::lvs::realserver` being applioed) but it m... [11:32:56] yep, I'll have a look at that. I'll create a ticket to look for those and possibly even monitor them [11:35:02] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) [11:37:49] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) ` brouberol@kafka-jumbo1014:~$ kafka topics --describe | grep 'ReplicationFactor:1' Topic:USER_REVISION_CREATES_PER_DOMAIN_PER_SESSION_OTTO1 PartitionCount:4 ReplicationFactor:1 Conf... [11:37:53] btullis here are the topics: https://phabricator.wikimedia.org/T346887 mostly test ones + ksql [11:55:28] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [11:55:28] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [11:55:49] gehel jennifer_ebe FYI: mw-page-content-change-enrich is back up in eqiad. I'll follow up with an incident report. tl;dr: there was no data loss, nor impact on SLOs in the active DC. I had a good learning moment about discovery routes and active/active DCs today :) [11:56:58] gmodena Thank you so much! Would be interested in reading how it got resolved [12:03:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10gmodena) [12:05:11] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10gmodena) [12:13:09] 10Data-Engineering: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10TheDJ) [12:14:09] 10Data-Engineering, 10Data-Engineering-Dashiki: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10TheDJ) [12:22:59] 10Data-Engineering, 10Data-Engineering-Dashiki: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10TheDJ) @Milimetric Am I misunderstanding the graphs ? It just seems really strange to not have any Windows 11 listed there, when according to google, 20% of windows is 11, and just 7... [12:35:09] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [BUG] mediawi-page-content-change-enrich: cross DC network calls to swift are failing - https://phabricator.wikimedia.org/T346877 (10gmodena) [12:37:26] gehel jennifer_ebe FYI https://wikitech.wikimedia.org/wiki/Incidents/2023-09-20_mw-page-content-change-enrich [12:38:05] gmodena: thanks! the data center switch is doing its job: we're learning about new failure modes! [12:38:30] gehel 100%! [12:44:41] brouberol: despite my multiple comments, congratulation on that patch. It already makes a lot of sense and is definitely going in the right direction! [13:04:40] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) > Do you mean HTTP error status codes that EventGate is emitting or th... [13:27:31] brouberol btullis do y'all have any idea how to copy zookeeper data between znodes? Re: https://phabricator.wikimedia.org/T342149#9181140 [13:31:03] inflatador: I've never actually needed to copy or rename znodes before. Looking now. [13:32:38] Have you got zkcli available? [13:33:23] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) [13:34:38] OK, so `/usr/share/zookeeper/bin/zkCli.sh` works on flink-zk1001 and I can see the list of available commands if I type `help` [13:41:08] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) [13:41:10] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [13:41:49] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:41:54] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [13:43:28] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [13:43:30] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:43:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10Gehel) 05Open→03Resolved [13:43:35] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: Flink Operations - https://phabricator.wikimedia.org/T328561 (10Gehel) [13:43:41] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [13:44:34] (03PS7) 10Aqu: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [13:45:25] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10xcollazo) > or the current release deployment didn't update those Variables. Actually, I am not sure of the expected behavior here... [13:46:03] 10Data-Platform-SRE: Troubleshoot rdf-streaming-updater/dse-k8s cluster - https://phabricator.wikimedia.org/T346048 (10bking) 05Open→03Resolved [13:46:05] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) [13:46:36] 10Data-Engineering, 10Data Pipelines (Sprint 14): Implement new AQS endpoints for Knowledge Gap metrics - https://phabricator.wikimedia.org/T337059 (10nickifeajika) 05Open→03Resolved This is done [13:46:38] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventgate: cache refreshes should fetch stream configs using a paginated API - https://phabricator.wikimedia.org/T346899 (10gmodena) [13:47:20] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [13:47:28] 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10BTullis) I did some sesrching for ksql and turned up a similar list of related topics that we cleaned up in 2020. T252675#6134141 All of the ones that have `OTTO` in the name are likely to be r... [13:48:22] 10Data-Platform-SRE: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10Gehel) [13:48:30] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [13:52:45] (03CR) 10CI reject: [V: 04-1] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [14:01:39] inflatador: I have written a zookeeper cli that could help w/ that (I think) but i don't think that'll help you in the end because you'd need to be able to pip install it [14:02:01] gehel: re congrats. Thanks! [14:02:25] 10Data-Engineering, 10Data-Engineering-Dashiki: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10Reedy) >Looking a bit further, it seems most likely to me that Windows 11 is simply not being detected and grouped incorrectly into the category other in the dashboard. https://meta... [14:05:00] (03PS11) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [14:05:01] inflatador: sorry about the emotional rollercoaster. That message ended up being of no help at all [14:05:26] brouberol btullis no worries, I think we won't have to do it [14:06:30] in case of https://github.com/KevinJMao/python-zktreeutil might be helpful [14:07:00] brouberol: We can do a pip install in our own conda environment on a stat server, so the rollercoaster is still going. The next problem would be if the zookeeper service is available from a stat server :-) [14:08:55] 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10lmata) hi @bking Moving to radar, as I understand you're already in contact with @andrea.denisse, that seems good to go, please let me know if we can assist further. [14:10:19] ouh, that's right [14:10:21] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10lmata) Moving to radar to keep an eye out in case you need our help. Thanks! [14:11:20] (03PS8) 10Aqu: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [14:12:20] 10Data-Engineering, 10Data-Engineering-Dashiki: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10TheDJ) But the version should still show in the major breakdown right ? Same as iOS family can go into major breakdown… [14:19:44] (03CR) 10CI reject: [V: 04-1] Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [14:27:14] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) The rotation/compression appears to work fine and usual day chunks are in the 2.5G ballpark, was there any unusual extra traffic which made it spike t... [14:35:18] btullis seems like we can;t reach conf1007 (aka zk) from stat1004 [14:35:54] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventgate: cache refreshes should fetch stream configs using a paginated API - https://phabricator.wikimedia.org/T346899 (10Ottomata) > Why only eventgate-analytics-external is configured to refresh its internal cache, without... [14:36:14] OK, wasn't the ticket about flink-zk1001 and not conf1007? Might be the same situation, but just so we're clear. [14:38:09] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10Milimetric) I vaguely remember this thing in 2018... Windows did get grouped up, but I agree with the DJ's points and that this data makes no sense without at leas... [14:38:48] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10Milimetric) Also related, {T342267} which really needs some love as well. [15:15:42] (SystemdUnitFailed) firing: superset.service Failed on an-tool1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:39] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:30] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventgate: cache refreshes should fetch stream configs using a paginated API - https://phabricator.wikimedia.org/T346899 (10gmodena) >>! In T346899#9183319, @Ottomata wrote: >> Why only eventgate-analytics-external is configur... [15:18:23] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventgate: cache refreshes should fetch stream configs using in batches - https://phabricator.wikimedia.org/T346899 (10gmodena) [15:22:37] PROBLEM - puppet last run on an-tool1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:24:03] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:28:50] (03CR) 10Mforns: ":pray:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958992 (https://phabricator.wikimedia.org/T344235) (owner: 10Mforns) [15:49:48] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10VirginiaPoundstone) p:05Triage→03High [15:52:01] (03CR) 10Phuedx: [C: 03+2] Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958992 (https://phabricator.wikimedia.org/T344235) (owner: 10Mforns) [15:55:38] (03Merged) 10jenkins-bot: Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958992 (https://phabricator.wikimedia.org/T344235) (owner: 10Mforns) [16:09:49] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10CodeReviewBot) tchin updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/4... [16:45:26] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire: 2023-09-20 Elasticsearch unavailable incident ticket - https://phabricator.wikimedia.org/T346945 (10bking) [16:48:44] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [16:52:36] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Data Products: Windows 11 missing in analytics ? - https://phabricator.wikimedia.org/T346890 (10Mayakp.wiki) This feels like an effect of Chrome's UA reduction where in Phase 5, the device OS was replaced. See Rollout details [[ https://www.chromium.org/updat... [17:12:32] (03PS12) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [17:16:47] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:42] (SystemdUnitFailed) resolved: superset.service Failed on an-tool1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:09] RECOVERY - puppet last run on an-tool1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:38:12] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [18:26:18] 10Data-Platform-SRE, 10Discovery-Search, 10SRE-OnFire, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Aklapper) [18:46:07] (03CR) 10Dmantena: [C: 03+2] Add watchlist-specific properties to ios_watchlists [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/955405 (https://phabricator.wikimedia.org/T334968) (owner: 10Tsevener) [18:52:55] (03Merged) 10jenkins-bot: Add watchlist-specific properties to ios_watchlists [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/955405 (https://phabricator.wikimedia.org/T334968) (owner: 10Tsevener) [19:03:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:32] milimetric: Heya - Do you have a couple minutes? [19:33:01] Yes cave! [19:33:08] OMW! [20:41:09] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, and 6 others: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10colewhite) >>! In T288624#913... [21:05:06] 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10Mayakp.wiki)