[02:35:16] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Chlod) [03:53:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [06:14:33] (03CR) 10Conniecc1: [C: 03+2] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [06:15:15] (03Merged) 10jenkins-bot: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [06:23:35] (03CR) 10Phuedx: [C: 03+2] Add product metrics fragments and schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [06:24:07] (03Merged) 10jenkins-bot: Add product metrics fragments and schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [07:41:15] (EventgateValidationErrors) firing: ... [07:41:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:46:15] (EventgateValidationErrors) resolved: ... [07:46:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:48:15] (EventgateValidationErrors) firing: ... [07:48:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:53:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [07:56:30] (EventgateValidationErrors) resolved: ... [07:56:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:58:16] (EventgateValidationErrors) firing: ... [07:58:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:31:24] * brouberol waves good morning [08:46:56] (03PS3) 10Brouberol: Rely on multiple kafka bootstrap servers in different racks [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 [08:48:17] aqu, ottomata: could I ask for a +2 for https://gerrit.wikimedia.org/r/c/analytics/refinery/+/968683 ? I don't have +2 permissions, and this would add a bit more relaibility to druid in the case of a cluster rolling restart, or an operation on the "wrong" broker. Thanks! [08:59:57] 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10brouberol) I saw that neither `kafka-logging` nor `kafka-test` have ACLs at all: ` # codfw brouberol@kafka-logging2003:~$ kafka acls... [09:03:41] (03CR) 10Aqu: [C: 03+2] "LGTM." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [09:05:36] (03CR) 10Brouberol: [V: 03+2] Rely on multiple kafka bootstrap servers in different racks [analytics/refinery] - 10https://gerrit.wikimedia.org/r/968683 (owner: 10Brouberol) [09:06:02] brouberol I +2 it, Would you like us with Sam to deploy it as part of our ops week ? https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Druid#Realtime_indexation_to_Druid [09:06:43] yes please! I added it to https://etherpad.wikimedia.org/p/analytics-weekly-train [09:07:29] (was it the wrong place to add it to?) [09:27:57] Morning all [09:56:25] (03CR) 10Aqu: [C: 03+1] "Looks good." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [10:27:12] !log running scap deploy for airflow-dags/analytics [10:27:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:33:08] !log restarting hadoop-hdfs-datanode.service and hadoop-yarn-nodemanager.service on an-worker1111 to pick up puppet7 changes. [10:33:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:43:32] We're about to roll out puppet 7 to all of the hadoop workers in production. We've tested on the hadoop-test cluster and we have tested a single worker in production, so we're as confident as we can be that it will be OK. I'll run a rolling restart of the worker processes once this is done. [10:46:31] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10BTullis) The steps look good to me Steve. There's a little duplication because you have said both of: > Drain the middlemanagers and > Set nodes into decommissioningNodes mode You can see from [[https://wikitech.wiki... [10:49:44] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10BTullis) I would also look at preparing the patch to [[https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service|remove the three hosts from LVS]] ahead of time, so that we have plenty of time to r... [11:53:57] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [12:00:06] 10Data-Platform-SRE, 10DBA, 10Data Engineering and Event Platform Team, 10Data-Services: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 (10BTullis) 05Open→03Resolved a:03BTullis This is done now. Once again it took two executions of the cookbook, but the opens... [12:01:31] (EventgateValidationErrors) firing: ... [12:01:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:07:19] !log roll restart druid workers to pick up new zookeeper host druid1009 T336042 [12:36:18] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10BTullis) > The days of this deployment are numbered, so anything we do here qualifies as a temporary fix. I agree, this is fine. We are sunsetting AQS 1.0 in the coming... [13:22:07] * brouberol is back [13:22:31] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) I switched both an-mariadb100[12] servers to use GTID based replication, rather than a simple binlog position. I have also switched db1208 to replicate from an-mariadb1001 inst... [13:23:41] (DruidSegmentsUnavailable) resolved: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [13:26:06] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [13:32:40] btullis: anything I can help w/ at the moment? [13:37:16] brouberol: Thanks. I think I'm ok at the moment. Several things going on, but sort of on top of them, ish. If I were you I would probably start looking at another bullseye upgrade ticket, like perhaps T332604 ? [13:37:17] T332604: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 [13:40:17] I *think* that these servers are OK to reimage and reformat `/srv` because they will automatically re-load the data from HDFS to `/srv/druid` [13:42:14] for sure [13:42:51] I would start with a host *other than* an-druid1001, because this host is used as an ingestion target: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/dag_default_args.py#L169-174 [13:44:11] On the 'analytics' druid cluster (that's all hosts named an-druid*) we don't have LVS load-balancing, so there is not a service address we can move around. [13:44:16] yep, I was opening codesearch to look for offenderes [13:44:21] *offenders [13:45:27] It's very confusing naming, because we have the 'public' druid cluster (druid10*) and the 'analytics' druid cluster (an-druid10*). [13:46:04] ack [13:46:22] I see a lot of `druid_overlord_url = http://an-druid1001.eqiad.wmnet:8090` in refinery [13:46:28] The public cluster has LVS, but for some fiddly reason to do with vlans, we don't have it on the analytics cluster. [13:46:31] https://www.irccloud.com/pastebin/ztxSRZkE/ [13:46:53] maybe we could reimage 2, 3, 4 and 5, change these overlord urls and then move on to 1? [13:47:48] Yes, that's about right. Don't forget that airflow-dags isn't in codesearch yet either :-) We should add it. [13:48:19] haha indeed, I have another tab open about this [13:48:24] maybe I should start there [13:48:39] So the cluster will be at 80% capacity whilst you reimage each node. [13:49:11] It's also worth familiarising yourself with the various different web interfaces of druid. e.g. https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Druid#Coordinators_Administration_UI [13:50:16] I think that the coordinator is the most useful, but only one of the five servers is the 'leader' [13:50:51] There is also a co-located zookeeper cluser with nodes 1-3. That will add to the fun. [13:51:34] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Increase k8s namespace limits for eventgate-analytics - https://phabricator.wikimedia.org/T350707 (10JMeybohm) a:03JMeybohm [13:54:03] Hmm. Zookeeper *should* just be ok if its `/var/lib/zookeeper` goes away and gets reinitialized, but then again keeping a backup might be a good idea. Here's a reference ticket, which were zookeeper specific servers: https://phabricator.wikimedia.org/T329362#8624597 [13:55:00] !log beginning rolling restart of all hadoop workers in production, to pick up new puppet 7 CA settings. [13:55:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:55:58] brouberol: Maybe it would be simpler to start with 1005 and work backwards to 1001 ? [14:08:35] yep, sorry, that's what I nea [14:08:38] *meant [14:08:55] my previous message didn't imply any order except that I'd finish by 1001 [14:09:01] but I wasn't clear at all [14:10:34] (03CR) 10Ottomata: [C: 03+2] Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [14:12:24] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: workflow_utils conda gitlab CI templates broken - https://phabricator.wikimedia.org/T350732 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_reque... [14:12:28] btullis: FYI https://gerrit.wikimedia.org/r/c/labs/codesearch/+/972829 <-- airflow-dags indexing in codesearch [14:13:13] FYI only as gerrit has already assigned reviewers [14:19:47] oh, it seems there's an automerge workflow setup. Now to figure out if we have CD on it [14:20:30] (03Merged) 10jenkins-bot: Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [14:21:55] actually scratch that. The reviewer was so fast that I mistook them for a bot 🤦 [14:22:05] the change should be visible ~tomorrow [14:32:55] Great, yes that's Amir.1 reviewing and deploying it. Good stuff. [14:33:00] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10brouberol) Is this ticket still needed? I see that admin groups such as `analytics-platform-eng-admins` can run `sud... [14:34:26] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10brouberol) Same goes for the following groups: - `analytics-research-admins` - `airflow-search-admins` - `airflow-an... [14:39:44] 10Data-Platform-SRE: Clean up deployment-charts leftovers after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [14:41:51] 10Data-Platform-SRE: Clean up deployment-charts leftovers after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [14:41:55] 10Data-Platform-SRE: Clean up deployment-charts leftovers after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [14:46:55] 10Data-Engineering, 10serviceops, 10Event-Platform: Increase k8s namespace limits for eventgate-analytics - https://phabricator.wikimedia.org/T350707 (10JMeybohm) 05Open→03Resolved [14:52:21] btulis: I'll start with an-druid1005. I've checked, there's no zookeeper running on it. It we _can_ avoid formatting /srv, shouldn't we try, though? [14:58:32] Yes, OK. It looks like we reserve the uid:gid for druid https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L1133-L1137 [14:59:06] So this means that when it gets reinstalled the files on `/srv` should be owned by the right posix user/group. [14:59:24] Feel free to select a reuse-parts recipe for these druid servers then. [14:59:54] It;s only ~1.8TB of data, but it's an amount we won't have to pull from HDFS for naught [15:43:52] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) I also see these [[ https://grafana-rw.wikime... [15:52:16] !log Add analytics-wmde service user to the Yarn production queue T340648 [15:52:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:52:19] T340648: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 [15:56:49] I *think* this will keep the lvm volume that we mount to /srv on an-druid hosts https://gerrit.wikimedia.org/r/c/operations/puppet/+/972851/ but if someone is accustomed to partmam recipes, I'd appreciate an extra set of eyes. Thanks! [15:58:59] (PuppetFailure) firing: (2) Puppet has failed on an-worker1127:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:59:05] I meant "will leep the lvm volume [...] during a reimage" [16:01:31] (EventgateValidationErrors) firing: ... [16:01:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:06:31] (EventgateValidationErrors) resolved: ... [16:06:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:08:24] 10Data-Platform-SRE: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [16:08:39] 10Data-Platform-SRE, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [16:09:16] (EventgateValidationErrors) firing: ... [16:09:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:11:31] (EventgateValidationErrors) resolved: ... [16:11:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:14:15] (EventgateValidationErrors) firing: ... [16:14:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:16:01] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10MediaWiki-extensions-Scribunto, and 5 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) [16:18:59] (PuppetFailure) firing: (3) Puppet has failed on an-worker1099:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:20:21] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Some googling indicates that this could possi... [16:24:31] (EventgateValidationErrors) resolved: ... [16:24:37] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:28:59] (PuppetFailure) firing: (5) Puppet has failed on an-worker1086:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:31:31] (EventgateValidationErrors) firing: ... [16:31:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:31:45] (EventgateValidationErrors) resolved: ... [16:31:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:33:59] (PuppetFailure) firing: (5) Puppet has failed on an-worker1086:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:37:15] (EventgateValidationErrors) firing: ... [16:37:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:38:59] (PuppetFailure) resolved: (5) Puppet has failed on an-worker1086:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:46:31] (EventgateValidationErrors) resolved: ... [16:46:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:47:15] (EventgateValidationErrors) firing: ... [16:47:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:52:50] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [17:48:32] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Chlod) Think this might be worth a #user-notice, given that it affects volun... [18:02:21] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar, 10Epic, 10Kubernetes: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking) @JMeybohm posted https://helm-playground.com in #wikimedia-k8s-sig today, this could be a piece of the puzzle as well. [18:04:28] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) Per conversation with @Gehel , we might need to do a scream test by shutting off the old instances. Will look into this next week. [18:25:40] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) [19:01:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:39:50] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) [19:47:07] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10EBernhardson) The relevant graphs seem reasonable, it looks like work has transferred over to the new instances. The msearch daemon only has work to perform on... [19:48:04] 10Data-Platform-SRE, 10Discovery-Search: Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10bking) [19:50:50] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) 05Open→03Resolved a:03Eevans This has been pushed out; The Bullseye upgrade is unblocked [19:50:54] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:52:44] 10Data-Platform-SRE, 10Discovery-Search: Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10bking) My rough notes around this subject are [[ https://etherpad.wikimedia.org/p/backfill | here ]] . I'm still learning Flink and Kafka, so will need some help creating the ba... [19:55:17] 10Data-Platform-SRE, 10Discovery-Search: Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10bking) [19:55:19] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [20:28:07] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) 05Open→03Invalid Upon further review, I'm declining this as invalid. We do need to track resource usage, but that... [20:28:11] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) [20:28:26] 10Data-Platform-SRE, 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) [20:29:30] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) a:03Ottomata [20:41:31] (EventgateValidationErrors) resolved: ... [20:41:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:47:56] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [20:49:43] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) @Chlod thanks. TIL about #User-Notice. Updated description for U... [20:53:07] Starting build #130 for job analytics-refinery-maven-release-docker [21:01:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on aqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:07:16] (EventgateValidationErrors) firing: ... [21:07:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:08:13] Project analytics-refinery-maven-release-docker build #130: 09SUCCESS in 15 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/130/ [21:09:11] Starting build #89 for job analytics-refinery-update-jars-docker [21:09:34] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.25 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972515 [21:09:34] Project analytics-refinery-update-jars-docker build #89: 09SUCCESS in 22 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/89/ [21:11:44] !log deploying refinery with refinery-source  0.2.25 jars for T321854 [21:12:15] (EventgateValidationErrors) resolved: ... [21:12:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:12:18] (03CR) 10Ottomata: [C: 03+2] Add refinery-source jars for v0.2.25 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972515 (owner: 10Maven-release-user) [21:12:20] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.25 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/972515 (owner: 10Maven-release-user) [21:33:34] 10Data-Platform-SRE: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 (10bking) [22:19:38] (03PS1) 10Bearloga: Document manual execution steps [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/972905 [22:19:57] (03CR) 10Bearloga: [V: 03+2 C: 03+2] Document manual execution steps [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/972905 (owner: 10Bearloga) [22:44:23] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Chlod) Oh, looks like I haven't subscribed to any of those lists. 😅 Re: wiki...