[00:03:31] (SystemdUnitFailed) firing: (17) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:29] (SystemdUnitFailed) firing: (17) refinery-sqoop-whole-mediawiki.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:02] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: prefix symbol that modifies unit magnitude - https://phabricator.wikimedia.org/T356534 (10nshahquinn-wmf) I vaguely recall that in the initial version of Wikistats 2, we did use the SI prefixes (k, M, G, T) consistently for all... [01:43:29] (SystemdUnitFailed) firing: (18) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:44] (SystemdUnitFailed) firing: (18) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:31] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:18:17] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10gmodena) @Fabfur here is example payload with added `meta`, as we'd expect to receive according to the [WIP webrequest eve... [08:45:21] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Jelto) 05In progress→03Resolved a:03Jelto Great! I'll resolve this task, all access should be available again. Feel free to reopen the ticket if... [09:21:25] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) 05Open→03Resolved [09:21:27] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [09:21:30] 10Data-Engineering (Sprint 8), 10Image-Suggestions, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 - https://phabricator.wikimedia.org/T356030 (10Gehel) 05Open→03Resolved a:03Gehel [09:43:44] (SystemdUnitFailed) firing: (18) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:31] 10Data-Engineering, 10Data-Platform-SRE: wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926 (10BTullis) I just discovered this old ticket whilst searching for something else. Apologies for having missed... [10:05:41] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Apply our patches not yet merged upstream to the supserset codebase in our Docker image - https://phabricator.wikimedia.org/T356477 (10CodeReviewBot) brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/sup... [10:09:33] btullis: is there anything left to do on T344202? Or can I close it? [10:09:34] T344202: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 [10:23:12] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10BTullis) Thanks @fgiunchedi. The new routing key is in place in the Alertmanager configuration. I have now merged th... [10:23:43] gehel: thanks for the prompt. I just checked and there is one small change left to do, just to get Icinga to match the new alertmanager config. Not a biggie. [10:24:36] thanks for moving it back to in progress, we won't loose that last bit [10:40:18] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) [10:40:32] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) a:03RKemper [10:40:40] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) a:03RKemper [10:40:46] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) [10:40:56] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) a:03RKemper [10:41:10] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) [10:45:18] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10Gehel) [10:45:45] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) [10:45:49] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10observability, 10Patch-For-Review: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) [10:45:51] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Create a helm chart for Superset - https://phabricator.wikimedia.org/T352166 (10Gehel) [10:45:58] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel) [10:46:11] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: [superset k8s] Update public domain DNS records to make them point to the DSE Kubernetes ingress - https://phabricator.wikimedia.org/T356482 (10Gehel) [10:46:13] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Create helmfile deployment files for superset and superset-next - https://phabricator.wikimedia.org/T353790 (10Gehel) [10:46:19] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 (10Gehel) [10:46:25] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: [superset k8s] Add entries to the puppet service catalog - https://phabricator.wikimedia.org/T356483 (10Gehel) [10:46:29] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [10:46:43] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481 (10Gehel) [10:46:52] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10Gehel) [10:46:55] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) [10:47:00] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10Gehel) [10:47:15] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) [10:47:19] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T356313 (10Gehel) [10:47:22] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 (10Gehel) [10:47:24] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10Gehel) [10:47:27] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10Gehel) [10:47:33] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10Gehel) [10:47:46] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10Gehel) [10:47:48] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) [10:47:53] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check log rotation settings on airflow instances - https://phabricator.wikimedia.org/T339015 (10Gehel) [10:47:57] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) [10:48:01] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Draft a kafka upgrade plan for all the WMF clusters - https://phabricator.wikimedia.org/T355550 (10Gehel) [10:48:05] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10Gehel) [10:48:09] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10Gehel) [10:48:17] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Observability-Alerting: Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10Gehel) [10:48:23] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10Gehel) [10:48:42] 10Data-Platform-SRE, 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10Gehel) a:05bking→03None [10:53:09] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of mhoutti - https://phabricator.wikimedia.org/T356641 (10Gehel) [10:53:16] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of shubhankar - https://phabricator.wikimedia.org/T355501 (10Gehel) [10:53:28] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Gehel) [10:53:32] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10Gehel) [10:53:37] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of jbond - https://phabricator.wikimedia.org/T352511 (10Gehel) [10:53:45] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of nickifeajika - https://phabricator.wikimedia.org/T354241 (10Gehel) [10:53:52] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Check home/HDFS leftovers of daniram - https://phabricator.wikimedia.org/T355108 (10Gehel) [10:58:30] (SystemdUnitFailed) firing: (18) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:56] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) With Benthos we have the ability to set actual metadata attached to the message (eg. with `meta` keyword) or simpl... [11:02:44] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10BTullis) In case it helps, we saw the same hardware error recently on a server in codfw. T355830#9517443 @Jhancock.wm was able to fix it... [11:29:22] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10kostajh) [11:29:43] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10kostajh) [11:46:52] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) >>! In T351117#9527936, @gmodena wrote: > ` > { > "meta": { > dt: "2023-11-23T16:04:17Z", # value set by... [12:18:30] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:30] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:30] (SystemdUnitFailed) resolved: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:05] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10gmodena) > Both approaches are feasible (also at the same time if we do accept to increase the payload a little)... Nice.... [12:42:38] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Migrate apifeatureusage hosts to Bullseye or later - https://phabricator.wikimedia.org/T346053 (10Gehel) Handover of apifeature usage isn't happening at the moment, #data-platform-sre will take care of this upgrade [12:47:16] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: Page views by country and total page namespaces are confusingly displayed - https://phabricator.wikimedia.org/T354932 (10lbowmaker) [12:48:27] 10Data-Engineering, 10Data Pipelines: Implement Image Recommendations Algorithm Performance Metrics - https://phabricator.wikimedia.org/T294478 (10lbowmaker) 05Open→03Declined No longer needed [12:49:29] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Add Product-Analytics Announcements to Airflow job for notifications - https://phabricator.wikimedia.org/T301281 (10lbowmaker) 05Open→03Resolved a:03lbowmaker This is done. PA receive notifications now. [12:52:08] 10Data-Engineering, 10Data Pipelines: SPIKE: Adapt our pipelines to codfw switch - https://phabricator.wikimedia.org/T328365 (10lbowmaker) 05Open→03Resolved Automated now [12:53:53] 10Data-Engineering, 10Data Pipelines: Data Warehouse Evaluation Spike. - https://phabricator.wikimedia.org/T323994 (10lbowmaker) 05Open→03Resolved Resolving as done. Doc is stored here: https://docs.google.com/document/d/1LVGG9yO_ogFgUY3L_7fAUqdcA436H8WVhT_lDkL9XEU/edit#heading=h.abb72eh1oq8g [12:56:30] 10Data-Engineering, 10Data Pipelines: Define and Create Logging Routines - Airflow UI - https://phabricator.wikimedia.org/T292747 (10lbowmaker) 05Open→03Declined No longer needed [12:57:08] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 09), 10Technical-Debt: Fix public documentation for mw.eventLog.submit() and dispatch() - https://phabricator.wikimedia.org/T357003 (10WDoranWMF) @apaskulin We are interested now! Th... [12:57:19] 10Data-Engineering, 10Data Pipelines: [Migration] migrate simple oozie jobs - https://phabricator.wikimedia.org/T324486 (10lbowmaker) 05Open→03Resolved Migration to Airflow is complete [13:24:13] 10Data-Platform-SRE: Review and decom all Search Platform servers past the 5-year rotation date - https://phabricator.wikimedia.org/T356887 (10Gehel) p:05Triage→03High [13:25:01] 10Data-Platform-SRE, 10Discovery-Search (Current work): Document review/refresh for https://wikitech.wikimedia.org/wiki/Search - https://phabricator.wikimedia.org/T356806 (10Gehel) p:05Triage→03High [13:25:17] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work): Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303 (10Gehel) [13:25:24] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work): Document review/refresh for https://wikitech.wikimedia.org/wiki/Search - https://phabricator.wikimedia.org/T356806 (10Gehel) [13:26:31] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Set requests (not limits) for cirrus-streaming-updater in k8s - https://phabricator.wikimedia.org/T348350 (10Gehel) [13:27:52] 10Data-Platform-SRE, 10CirrusSearch, 10Discovery-Search, 10serviceops: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) a:05brouberol→03None [13:28:11] 10Data-Platform-SRE, 10Movement-Insights: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy - https://phabricator.wikimedia.org/T356230 (10Gehel) p:05Triage→03Medium [13:28:29] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Movement-Insights: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy - https://phabricator.wikimedia.org/T356230 (10Gehel) [13:28:43] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Movement-Insights: Create a DataHub group for the Movement Insights team - https://phabricator.wikimedia.org/T354211 (10Gehel) [13:29:11] 10Data-Platform-SRE, 10Movement-Insights: Package versions in Conda-Analytics are not pinned - https://phabricator.wikimedia.org/T356231 (10Gehel) p:05Triage→03Medium [13:29:23] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Movement-Insights: Package versions in Conda-Analytics are not pinned - https://phabricator.wikimedia.org/T356231 (10Gehel) [13:29:40] 10Data-Platform-SRE: Remove nickifeajika from analytics-privatedata-users - https://phabricator.wikimedia.org/T353665 (10Gehel) p:05Triage→03Low [13:30:02] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Remove nickifeajika from analytics-privatedata-users - https://phabricator.wikimedia.org/T353665 (10Gehel) [13:30:31] 10Data-Engineering, 10Data-Platform-SRE: wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926 (10Gehel) p:05Triage→03Medium [13:30:56] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926 (10Gehel) [13:31:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10Gehel) p:05Triage→03High [13:31:18] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work): Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803 (10Gehel) [13:31:28] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE: Use custom CDN if possible for Jupyter HTML exported notebooks - https://phabricator.wikimedia.org/T357064 (10Gehel) p:05Triage→03Medium [13:31:49] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Use custom CDN if possible for Jupyter HTML exported notebooks - https://phabricator.wikimedia.org/T357064 (10Gehel) [13:36:45] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Migrate Search Platform-owned hosts to Puppet 7 - https://phabricator.wikimedia.org/T354959 (10Gehel) p:05Triage→03Medium [13:36:52] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: [superset k8s] Add entries to the puppet service catalog - https://phabricator.wikimedia.org/T356483 (10Gehel) p:05Triage→03High [13:37:02] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: [superset k8s] Update public domain DNS records to make them point to the DSE Kubernetes ingress - https://phabricator.wikimedia.org/T356482 (10Gehel) p:05Triage→03High [13:37:04] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 (10Gehel) p:05Triage→03High [13:37:12] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481 (10Gehel) p:05Triage→03High [13:37:17] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Observability-Alerting: Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795 (10Gehel) p:05Triage→03Medium [13:37:19] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Draft a kafka upgrade plan for all the WMF clusters - https://phabricator.wikimedia.org/T355550 (10Gehel) p:05Triage→03Medium [13:37:25] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10Gehel) p:05Triage→03Medium [13:37:27] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10Gehel) p:05Triage→03High [13:47:03] !log deploying superset/superset-next services in dse-k8s-eqiad - T347710 [13:47:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:47:06] T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 [13:57:37] !og the production superset deployment failed due to a wrong MySQL password - T347710 [13:57:39] T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 [14:01:12] !log superset was successfully deployed once the MySQL password was updated - T347710 [14:01:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:11:07] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481 (10brouberol) ` brouberol@dns1004:~$ dig +short superset-next.svc.eqiad.wmnet superset.svc.eqiad.wmnet k8s-ingress-dse.svc.eqiad.wmnet. 10... [14:15:19] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481 (10brouberol) 05Open→03Resolved [14:15:21] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol) [14:16:20] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure ingress internal DNS records - https://phabricator.wikimedia.org/T356481 (10brouberol) ` brouberol@cumin1002:~$ curl -v https://superset-next.svc.eqiad.wmnet:30443/health * Uses proxy env variable no_proxy == 'w... [14:50:56] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926 (10mpopov) Just tried the query in the description with some recent dates but the d... [14:53:47] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol) [14:54:14] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10brouberol) [15:01:38] 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) 05Open→03Resolved [15:08:31] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: Make wikistats pages, sections and individual infoboxes transcludable - https://phabricator.wikimedia.org/T351053 (10lbowmaker) [15:09:33] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10lbowmaker) 05Open→03Resolved [15:11:10] 10Data-Engineering, 10Data Pipelines: webrequest / webrequest raw quality check - https://phabricator.wikimedia.org/T334678 (10lbowmaker) 05Open→03Declined Data quality checks for webrequests have been implemented in Q1 2024 using DeeQu. https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality [15:13:25] 10Data-Engineering, 10Data Pipelines, 10Documentation: Document the new Airflow backend: PostgreSQL - https://phabricator.wikimedia.org/T325138 (10lbowmaker) 05Open→03Resolved a:03lbowmaker https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#Create_the_Airflow_PostgreSQL_Database [15:15:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10lbowmaker) 05Open→03Resolved Resolving ticket as we have moved on with Airflow versions. [15:16:03] 10Data-Engineering, 10Data Pipelines: When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10lbowmaker) 05Open→03Resolved a:03lbowmaker Resolving, job has been migrated to Airflow now. [15:16:34] 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 3 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9491036, @akosiaris wrote: >>>! In T355685#9491033, @Lucas_Werkmeister_WMDE wrote: >>>>! In T355685#9490969, @akosi... [15:17:13] 10Data-Engineering, 10Data Pipelines, 10Data Products, 10Privacy Engineering: Add cswiki to clickstream - https://phabricator.wikimedia.org/T339805 (10lbowmaker) [15:21:31] 10Data-Engineering, 10Data Pipelines: [Airflow] Gather dataset information from DataHub - https://phabricator.wikimedia.org/T327816 (10lbowmaker) 05Open→03Declined Declining as we are now working on the data config store: https://phabricator.wikimedia.org/T354557 [15:23:29] 10Data-Engineering, 10Data Pipelines: Review iceberg settings and document choices - https://phabricator.wikimedia.org/T312151 (10lbowmaker) Resolving as we have moved forward with Iceberg: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Iceberg [15:24:13] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: Missing contributor stats for Singapore - https://phabricator.wikimedia.org/T344624 (10lbowmaker) [15:27:25] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 (10bking) Just checked again...the reindexing process is still going. The script is running in a tmux window under my user on `mwmaint2002`. [15:27:39] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: [SPIKE] Investigate what happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10lbowmaker) 05Open→03Resolved a:03lbowmaker Resolving ticket as this work is complete. https://wikitech.wi... [15:27:53] 10Data-Engineering, 10Data Pipelines: Review iceberg settings and document choices - https://phabricator.wikimedia.org/T312151 (10lbowmaker) 05Open→03Resolved a:03lbowmaker [15:28:30] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466 (10BTullis) a:05BTullis→03None Thanks for the update @mforns. Keep us posted as the plans develop. I'll... [15:29:34] 10Data-Engineering, 10Data Products: Mediarequests top articles: should use a disallow filter just like top articles - https://phabricator.wikimedia.org/T343793 (10lbowmaker) [15:31:40] 10Data-Engineering, 10Event-Platform (Sprint 14 B): mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10lbowmaker) 05Open→03Resolved [15:33:10] 10Data-Platform-SRE: Improve Elastic operation macros/tmux - https://phabricator.wikimedia.org/T357142 (10bking) [15:34:36] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10lbowmaker) The new version of Superset allows nested data types to be visualized and we have spoken with users about using Hive cli to update their datasets instead of using Hue. Conversa... [15:45:30] 10Data-Platform-SRE: Monitor Elastic S3 repository status - https://phabricator.wikimedia.org/T357146 (10bking) [15:45:46] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Fix Elastic S3 repository status - https://phabricator.wikimedia.org/T357018 (10bking) [15:54:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:55:27] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10BTullis) There is one more patch required for these servers, which is related to the `rsyncd-published service` that sync data to an-web1001. ` alertname = SystemdUnitFailed instance = stat1... [15:57:57] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10BTullis) I have silenced the check for 14 days. Same for stat1010. [15:58:00] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Fix Elastic S3 repository status - https://phabricator.wikimedia.org/T357018 (10bking) 05Open→03Resolved I've moved the monitoring part of this ticket to T357146 as it is less urgent than fixing the immediate problem. Closing... [16:24:45] 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 (10bking) 05In progress→03Resolved I was using the wrong arguments to the script...reran the script [[ https://phabricator.wikimedia.org/P56589 | with the right arguments... [16:24:49] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10bking) [16:27:04] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...' - https://phabricator.wikimedia.org/T320926 (10mpopov) `lang=sql SELECT user_agent from wmf.webrequest WHERE webrequest_sour... [17:01:55] 10Data-Engineering, 10Data Pipelines: Implement Image Recommendations DAG Performance Metrics - https://phabricator.wikimedia.org/T294480 (10lbowmaker) 05Open→03Resolved [17:04:28] 10Data-Engineering, 10Data Pipelines, 10Spike: [SPIKE] Webrequest migration - https://phabricator.wikimedia.org/T324488 (10lbowmaker) 05Open→03Declined [17:07:34] 10Data-Engineering, 10Data Pipelines: Airflow does not send SLA emails nor update SLA misses in the db - https://phabricator.wikimedia.org/T314181 (10lbowmaker) 05Open→03Resolved [17:09:03] 10Data-Engineering, 10Data Pipelines: Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10lbowmaker) 05Open→03Resolved Resolving, we have moved forward with dumps 2.0 [17:09:07] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review: Prototype Spark Streaming Job for Content Dumps - https://phabricator.wikimedia.org/T322326 (10lbowmaker) [17:12:02] 10Data-Engineering, 10Data Products, 10Data-Platform, 10MediaWiki-extensions-EventLogging, and 2 others: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10lbowmaker) [17:55:02] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye [18:11:27] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10VRiley-WMF) a:05bking→03VRiley-WMF [18:15:26] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10VRiley-WMF) Worked with @bking on this. Verified it was okay to power down. Reseated the cable for the backplane and gave it a very stern... [18:15:37] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10DC-Ops, 10SRE, 10ops-eqiad: Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10VRiley-WMF) 05Open→03Resolved [18:15:42] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 (10VRiley-WMF) [18:25:17] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Security: Use custom CDN if possible for Jupyter HTML exported notebooks - https://phabricator.wikimedia.org/T357064 (10dr0ptp4kt) [18:36:12] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye completed: - cloudelastic1008 (**PA... [18:46:10] 10Data-Engineering, 10Data Pipelines, 10Product-Analytics: Add Product-Analytics Announcements to Airflow job for notifications - https://phabricator.wikimedia.org/T301281 (10Mayakp.wiki) @mpopov , curious to know if some of us in Movement Insights can be added to this mailing list to get the alerts instead... [19:39:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:22:49] 10Data-Engineering, 10Data Pipelines: Convert to pure Docker the gitlab CI pipeline to build debianized conda - https://phabricator.wikimedia.org/T315475 (10nshahquinn-wmf) [21:24:46] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - conflicting dependencies between r-base and other - https://phabricator.wikimedia.org/T343823 (10nshahquinn-wmf) [21:24:48] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Movement-Insights: Package versions in Conda-Analytics are not pinned - https://phabricator.wikimedia.org/T356231 (10nshahquinn-wmf) [21:52:54] 10Data-Engineering, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10superset.wikimedia.org: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10cchen) @BTullis just tried the test case again, and it works for me now. Thank you! [22:39:35] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) [22:42:15] 10Data-Engineering, 10Data Products, 10Observability-Logging, 10Traffic, 10Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) > `"dt": "2024-02-07T12:04:41.512972244Z",` I //think// that this more precise timestamp would be parseable by... [23:47:38] 10Data-Engineering, 10EventStreams, 10Prod-Kubernetes, 10serviceops, and 2 others: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Ottomata) > wondering about the stream connection duration IIRC, varnish(?) sets a http timeout of something like... [23:55:49] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey)