[02:18:57] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10sguebo_WMF) Hi, all — I’ll share here a joint privacy review of the two proposed changes: enabling the TagManager... [03:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.303% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:01:49] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats: Make wikistats pages, sections and individual infoboxes transcludable - https://phabricator.wikimedia.org/T351053 (10Klein) [07:31:48] (03CR) 10Awight: Remove deprecated tech wish scripts (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [07:42:45] * brouberol waves good morning! [07:43:07] * joal answers back [07:43:12] o/ [07:48:06] (03CR) 10Thiemo Kreuz (WMDE): Remove deprecated tech wish scripts (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [07:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.303% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:59:15] (EventgateValidationErrors) firing: ... [07:59:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:00:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:11:02] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10JAllemandou) I think this ticket can be closed. We now have a job giving us statistics details on HDFS folders. [08:15:34] 10Data-Engineering, 10Data Pipelines: Prune raw HDFS FSImages stored on HDFS - https://phabricator.wikimedia.org/T325103 (10JAllemandou) [08:16:02] 10Data-Engineering, 10Data Pipelines: Prune raw HDFS FSImages stored on HDFS - https://phabricator.wikimedia.org/T325103 (10JAllemandou) I think we should implement this quickly. It's relatively cheap and this consumes cluster storage space for no good reason. [08:16:47] (03CR) 10WMDE-Fisch: Remove deprecated tech wish scripts (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [08:17:26] 10Data-Platform-SRE: Regenerate the skein certificates during the first buisiness day of the month - https://phabricator.wikimedia.org/T350945 (10brouberol) Every Tuesday it is [08:20:40] 10Data-Engineering: Reduce the number of files generated by geoeditors airflor jobs - https://phabricator.wikimedia.org/T304852 (10JAllemandou) I think this has been solved - I've checked the data and there is no more folder with big number of files. @mforns can you confirm if you remember finishing this? Many t... [08:26:25] 10Data-Platform-SRE, 10SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10brouberol) [08:30:23] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [08:30:26] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10brouberol) 05Open→03In progress a:03brouberol [08:39:31] (EventgateValidationErrors) resolved: ... [08:39:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:46:16] (EventgateValidationErrors) firing: ... [08:46:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:51:16] (EventgateValidationErrors) resolved: ... [08:51:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:54:15] (EventgateValidationErrors) firing: ... [08:54:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:58:20] is there a visual way to browse existing hiera data? A web interface, or something else? [09:00:29] 10Data-Platform-SRE, 10SRE, 10Patch-For-Review: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10Peachey88) [09:01:52] brouberol: your IDE or https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/ :-P [09:02:01] but probably not what you were looking for :D [09:04:00] (also mirrored to github ofc) [09:06:37] gotcha, I thought that, somehow, we could browse them in a puppetboard-like interface. No worries [09:07:36] so in https://gerrit.wikimedia.org/r/c/operations/puppet/+/973308/comment/972bd071_80aa7433/, when you're referring to hiera data coming from netbox, are you referring to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/network/data/data.yaml ? [09:08:16] [to previous comment] unfortunately not, the actual set of hiera effectively loaded for compiling a given catalog is a bit harder to get [09:11:44] no, I'm referring to the data that we export semi-automatically from netbox (automatic generations, human confirmation) via the sre.puppet.sync-netbox-hiera cookbook that then are exported into a git repository accessed also by puppet (look for netbox in modules/puppetmaster/files/hiera/production.yaml) [09:12:18] looking, thanks [09:12:54] (some related docs at https://wikitech.wikimedia.org/wiki/Monitoring/sre.puppet.sync-netbox-hiera.timer ) [09:13:20] jbond: I couldn't find a better docs for the netbox-hiera integration, did I looked in the wrong places or are we missing it? [09:33:53] volans: i think thats the best we have currently [09:34:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove deprecated tech wish scripts (032 comments) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [10:06:39] (03CR) 10Urbanecm: [C: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:26:33] (03PS1) 10Phuedx: Add sampling configuration to /analytics/mediawiki/client/metrics_event [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/973729 (https://phabricator.wikimedia.org/T350495) [10:27:05] (03CR) 10CI reject: [V: 04-1] Add sampling configuration to /analytics/mediawiki/client/metrics_event [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/973729 (https://phabricator.wikimedia.org/T350495) (owner: 10Phuedx) [10:30:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:30:30] (03PS2) 10Phuedx: Add sampling configuration to /analytics/mediawiki/client/metrics_event [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/973729 (https://phabricator.wikimedia.org/T350495) [10:30:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:31:59] (PuppetFailure) firing: Puppet has failed on an-worker1101:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:35:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:41:59] (PuppetFailure) resolved: Puppet has failed on an-worker1101:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:44:38] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10BTullis) @taavi - Thanks so much, that does look really helpful. The only other thing I think would be helpful is if we could somehow also remove the spof on... [10:49:23] !log systemctl reload haproxy on dbproxy1019 to depool the web wikireplica cluster [10:49:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:55:11] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) My change doesn't immediately fix the proxy redundancy issue, but it definitely makes it much easier to solve.as all of the backend configuration will a... [11:01:59] (03PS1) 10Joal: Fix unique_devices iceberg insertion jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/973742 [11:04:56] (03PS2) 10Joal: Fix unique_devices iceberg insertion jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/973742 (https://phabricator.wikimedia.org/T350920) [11:06:23] 10Data-Engineering (Sprint 5), 10Data-Platform, 10Movement-Insights, 10Patch-For-Review: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 - https://phabricator.wikimedia.org/T350920 (10JAllemandou) Thank you for @Hghani for this finding. All unique-devices iceberg insertion jobs wher... [11:08:03] 10Data-Engineering (Sprint 5): [Data Platform] Document proposal for data-product configuration store - https://phabricator.wikimedia.org/T349746 (10JAllemandou) The document [[ https://docs.google.com/document/d/1tuoRviz3kNgUNOnSjtP5Pr6ikAiZWOdWxUytrDd1ZKs/edit?pli=1#heading=h.3k1uzt7e33l4 | is here ]]. [11:08:20] !log rebooting clouddb1013 to pick up new kernel and SSL CA settings [11:08:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:11:09] btullis: im not sure if you are aware but you have +2'ed this but not merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/968666/ [11:12:06] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) Or we could use the opportunity to do both changes at the same time, and also combine it with moving the load balancing to our new `cloudlb` setup and r... [11:13:26] jbond: Thanks, you're right. I did sort of do it on purpose, but then I got a bit sidetracked with the wikireplicas restart. I'm doing the wikireplicas restart now, so I'll try to merge the one you mentioned after that. [11:14:25] ack thanks [11:19:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on an-worker1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:58] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:36] FYI, I'm switching kafka-jumbo1007 to Puppet 7 in a bit, let me know if you spot any issues. the kafka-test cluster has already been migrated last week,so I wouldn't expect any issues [11:33:22] (the new Puppetserver 7 infrastructure runs in parallel to the legacy puppet 5 masters) [11:56:46] Ack - thanks moritzm [11:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.303% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:17:08] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on an-worker1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:52] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10BTullis) >>! In T300427#9326131, @taavi wrote: > Or we could use the opportunity to do both changes at the same time, and also combine it... [12:31:06] headsup, I'm going to restart kafka-jumbo1007 [12:31:21] brouberol: Ack, thanks. [12:31:57] !log repooled clouddb10[13-16] post maintenance. [12:32:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:32:12] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-worker1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:36] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:14] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) >>! In T300427#9326322, @BTullis wrote: > That would mean: > * integrating the work on this ticket, correct? {T346947} > * whilst... [12:45:38] 10Data-Platform-SRE, 10Patch-For-Review: Regenerate the skein certificates during the first buisiness day of the month - https://phabricator.wikimedia.org/T350945 (10brouberol) 05Open→03Resolved [12:54:16] (EventgateValidationErrors) firing: ... [12:54:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:18:51] FYI, migrating the rest of kafka-jumbo to Puppet 7 now [13:20:05] ack, thanks [13:25:44] kafka/jumbo is done [13:30:27] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:12] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on an-worker1135:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:19] !log reloaded haproxy on dbproxy1018 to depool the analytics wikireplicas cluster [13:57:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:08:12] 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) [14:10:24] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) 05Open→03Resolved [14:14:28] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) [14:16:23] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) [14:17:53] 10Data-Platform-SRE: Regenerate the skein certificates during the first business day of the month - https://phabricator.wikimedia.org/T350945 (10Aklapper) [14:32:21] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10Ottomata) > Allow headers to be set via Event-Gate And also by our Kafka Serialization code, similar to the work @gmodena is doing in {T338231} The config and implementation in eventgate-wikimedia and in wikimedia-event-ut... [14:38:15] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10Ottomata) > Kafka supports headers that can carry string-byte[]-tuples. Ah, right, so I suppose the byte[] value will be the json data. But, perhaps for headers, it would be simpler to support only string: string headers, i... [14:59:43] 10Data-Engineering (Sprint 5): [Data Quality] [Needs Grooming] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10Ahoelzl) [15:06:33] 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10Antoine_Quhen) We have multiple needs considering scheduling Airflow dags & tasks. Without a definitive solution, an alternati... [15:36:36] (03CR) 10Aqu: [C: 03+1] "Looks good." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/973742 (https://phabricator.wikimedia.org/T350920) (owner: 10Joal) [15:38:13] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) Another progress report: We are 80% (869/1104) done on the leading host (wdqs1022). [15:50:08] 10Data-Engineering (Sprint 5): [Data Platform] Document proposal for data-product configuration store - https://phabricator.wikimedia.org/T349746 (10Ahoelzl) Provide list of configuration use cases and implications. [15:52:15] 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10bking) >>! In T350703#9318370, @MoritzMuehlenhoff wrote: > I've also extended this task to cover the restarts for WCSQ and WQDS, I've just rolled out the respective Jav... [15:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:09:10] (03CR) 10Xcollazo: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/973742 (https://phabricator.wikimedia.org/T350920) (owner: 10Joal) [16:10:23] 10Data-Platform-SRE, 10Discovery-Search (Current work): CirrusSearch: make p95 alerts more granular - https://phabricator.wikimedia.org/T349340 (10Gehel) [16:11:56] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > I think we should be good to go. @xcollazo - if you have some time to test please, that would be great. Hey @BTullis, hav... [16:16:27] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) [16:20:37] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) a:03Antoine_Quhen [16:22:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10Gehel) [16:22:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [16:25:42] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 (10Gehel) [16:34:18] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) [16:43:20] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) [16:54:16] (EventgateValidationErrors) firing: ... [16:54:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:55:03] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) > That way every header is a simple string key: string value? @Ottomata, sure, that would save us one source of error. > It'd be nice to be able to use custom logic to set the header, so we could set something li... [16:59:41] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10Ottomata) > the event utilities implement that convention. I think I'd prefer that we didn't add custom logic to set headers if we can avoid it. Perhaps we can justify it for canary events, but I have a feeling it will add... [17:01:01] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10Ottomata) And, actually, having all the 'domain' in the header in general might be useful if someone wants/needs to do some custom filtering on other domains (e.g. wikidata?) before deserializing too? Of course the lib woul... [17:04:17] (EventgateValidationErrors) resolved: ... [17:04:17] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:07:55] !log deploying refinery with refinery source 0.2.25 jars and using 0.2.25 for refine job - T321854 [17:07:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:07:58] T321854: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 [17:16:46] (EventgateValidationErrors) firing: ... [17:16:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:21:46] (EventgateValidationErrors) resolved: ... [17:21:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:22:00] (EventgateValidationErrors) firing: ... [17:22:01] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:44:27] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10MediaWiki-extensions-Scribunto, and 8 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) p:05Triage→03High [17:53:26] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/mjolnir-deploy/-/merge_requests/1 Update mjolnir to... [17:53:36] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/mjolnir-deploy/-/merge_requests/1 Update mjolnir to... [17:55:35] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Add mw.eventLog.pageviewInSample() - https://phabricator.wikimedia.org/T348777 (10phuedx) [17:56:03] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Add mw.eventLog.pageviewInSample() - https://phabricator.wikimedia.org/T348777 (10phuedx) 05Open→03Resolved a:03phuedx [17:56:08] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx) [17:56:58] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14): Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx) This is Done™. I'm leaving this task open to track monitoring the client-side error logs during this week's train d... [17:57:30] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10EBernhardson) The repo and the airflow side were updated, but we missed updating the search-loader daemons with the new version. Patches ab... [18:58:02] 10Data-Engineering: [Data Quality] Log selected Spark metrics and visualize on dashboard - https://phabricator.wikimedia.org/T349764 (10Ahoelzl) [18:59:33] 10Data-Engineering, 10Observability-Metrics: [Data Quality] Sending Apache Spark metrics to PushGateway - https://phabricator.wikimedia.org/T297231 (10Ahoelzl) [19:00:25] 10Data-Engineering (Sprint 5): [Data Quality] [Needs Grooming] Calculate and log post processing metrics for webrequests - https://phabricator.wikimedia.org/T349456 (10Ahoelzl) [19:25:58] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Talked to @EBernhardson last week and one thing we were uncertain of is if it made sense to set SLOs on metrics such as `MediaSearch latency p95` which, with the met... [19:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:05:54] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Trust-and-Safety, 10Russian-Sites: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10stjn) @lbowmaker: given your move of both this task (that requires less work) and T333716 to ‘icebox’, can you c... [20:41:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10mpopov) There's a good chance this is responsible for {T347076} [20:42:22] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) Reverting deployment for production refine jobs. There was an edge... [20:45:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:14] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) Both apps (commons and wikidata) are stable in staging-eqiad now: ` bking@deplo... [21:11:12] 10Data-Platform-SRE: Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10bking) [21:22:01] (EventgateValidationErrors) firing: ... [21:22:01] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:27:02] !log reloading haproxy on dbproxy1018 post maintenance [21:27:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:28:46] !log deploying updated datahub containers for T348647 [21:28:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:28:54] 10Data-Platform-SRE: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10bking) [21:29:04] 10Data-Platform-SRE: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10bking) [21:29:06] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) [21:39:16] (EventgateValidationErrors) resolved: ... [21:39:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:45:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:24:37] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039 (10bking) Per the last deploy message above, it looks like mjolnir is running successfully under Bullseye and Python 3.10. The next step is to decom the older, bu... [22:46:36] Is there a convenient way to transfer DDL for all hive schemas on the stat boxes, so that my local IDE can provide useful completion? Or is there another suggestion such as ssh tunneling pyspark, or developing in a remote jupyter notebook... [22:52:57] 10Data-Platform-SRE, 10Patch-For-Review: Decommission search-loader1001/2001 VMs - https://phabricator.wikimedia.org/T351123 (10bking) Command should be ` sudo cookbook sre.hosts.decommission search-loader1001.eqiad.wmnet,search-loader2001.codfw.wmnet -t T351123`. I'm at the end of my day, so will run tomorrow. [22:54:32] awight: You've got me there. I haven't tried any of these things yet. I wouldn't be at all surprised if a remote notebook worked from your local IDE, using SSH tunnelling, but would that give you auto-completion? I'm not even sure. [23:00:05] I'm writing few short scripts and happy to use whatever is normal, just not sure what that would be? The wikitech pages seem to recommend the hive commandline... [23:34:23] !log rebooting clouddb1021 to pick up new kernel and puppet 7 CA. [23:34:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:58:54] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.302% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace