[01:06:50] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata's presto.run fails with Urllib3 v2 or higher - https://phabricator.wikimedia.org/T345309 (10nshahquinn-wmf) [01:31:45] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata's presto.run fails with Urllib3 v2 or higher - https://phabricator.wikimedia.org/T345309 (10nshahquinn-wmf) [03:30:33] 10Data-Platform-SRE, 10Data-Services: Queries to externallinks table fail following schema changes - https://phabricator.wikimedia.org/T344866 (10Count_Count) Can someone run `maintain-views` for s2? ` Could not get links for wikimap.toolforge.org: Database(Database(MySqlDatabaseError { code: Some("HY000"), n... [05:06:39] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9099641, @Stevemunene wrote: > From the [] Create WMDE airflow admin group review, the `aiflow-wmde-admins` group requires a system user in order to per... [05:18:15] 10Data-Engineering: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10phuedx) [05:29:25] 10Data-Engineering: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10phuedx) This can be implemented in our jsonschema-tools package or we could request a change upstream. AIUI the json-schema-merge-allof package allows for its behaviour to be changed... [05:40:47] 10Data-Engineering: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10phuedx) [07:32:49] 10Data-Platform-SRE, 10Data-Services: Queries to externallinks table fail following schema changes - https://phabricator.wikimedia.org/T344866 (10Ladsgroup) Done [07:50:06] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [07:55:29] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [08:38:53] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10elukey) Hi folks! Yes I'd follow what we did for `analytics-product` etc.. since we'll create the same system user (uid/gid) across nodes (airflow, stat100x, hadoop worker nodes, et... [09:05:39] 10Data-Engineering, 10Data-Persistence: clouddb1017/MariaDB memory is CRITICAL - https://phabricator.wikimedia.org/T345322 (10aborrero) [09:21:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:29] joal: I just created T345327 and added it to java-scala-standardization [09:40:30] T345327: Create a Maven archetype so that we can easily create new Maven based projects - https://phabricator.wikimedia.org/T345327 [09:40:42] Awesome - thank you gehel [09:41:23] joal: I already have a WIP patch about it: https://gerrit.wikimedia.org/r/c/wikimedia/discovery/discovery-parent-pom/+/934276 [09:41:30] wow - too fast [09:51:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:57] stevemunene: If you have time for a review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/950136. I have one more test I need to do (generating the config and comparing it to the process currently running on the servers), but otherwise it should be good. [10:10:07] ping me if you need more context! [10:12:26] 10Data-Engineering, 10Data-Persistence: clouddb1017/MariaDB memory is CRITICAL - https://phabricator.wikimedia.org/T345322 (10aborrero) Seen on IRC backlog: `lang=irc Amir1: are you around/available to help me find and kill a query that's about to crash clouddb1017? I thought I knew how but it... [10:16:12] (03PS6) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [10:23:36] 10Data-Engineering, 10Metrics Platform Backlog: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10phuedx) [10:29:45] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [10:34:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:00] (03CR) 10Phuedx: [C: 03+1] Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [10:54:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:27] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:29] 10Data-Engineering, 10Data-Persistence: clouddb1017/MariaDB memory is CRITICAL - https://phabricator.wikimedia.org/T345322 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [11:52:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:44] !log About to deploy analytics refinery (weekly train) [12:01:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:05:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:48] !log Deployed refinery using scap, then deployed onto hdfs [13:02:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:22] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye [14:20:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:41] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye completed: - wdqs1010 (**WARN**) - Removed from Puppet and Pupp... [15:50:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:52] (03PS1) 10TChin: Skip schema-deterministic-types for metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) [16:29:06] 10Data-Engineering, 10Metrics Platform Backlog: Make jsonschema-tools merge values of enums when merging allOf - https://phabricator.wikimedia.org/T345317 (10Ottomata) > AIUI the json-schema-merge-allof package allows for its behaviour to be changed on a per-keyword > My preference would be to implement this... [17:30:01] 10Data-Engineering, 10Data-Persistence: clouddb1017/MariaDB memory is CRITICAL - https://phabricator.wikimedia.org/T345322 (10Andrew) Thank you @Ladsgroup [17:56:35] (03CR) 10Gmodena: "Are you already in touch with metrics platform folks on this? If possible I'd rather we fix the root cause of the error, rather than skipp" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin) [18:25:28] (03CR) 10TChin: Skip schema-deterministic-types for metrics_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin) [18:36:56] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10Gehel) Quick check, it looks that https://data.nlg.gr/query is the URL to the UI, but the SPARQL endpoint... [18:55:03] (03CR) 10Mforns: [C: 03+1] "LGTM!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [19:00:33] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10RKemper) [19:01:31] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - https://phabricator.wikimedia.org/T343823 (10Gehel) [19:27:50] 10Data-Platform-SRE, 10Patch-For-Review: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10RKemper) Generated new cergen certs for `wdqs.discovery.wmnet` that include `wdqs1016` in the `alt_names` instead of `wdqs1005`. Followed the steps below: ` INSTRUCTIONS (1) Edit /srv/private/modu... [20:11:43] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: `wdqs1005.eqiad.wmnet` - wdqs1005.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found physical host... [20:14:37] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10RKemper) [20:14:52] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10RKemper) [20:51:53] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk2001.codfw.wmnet` - flink-zk200... [21:08:26] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk2003.codfw.wmnet` - flink-zk200... [21:27:29] (03CR) 10Clare Ming: [C: 03+1] "lgtm \o/" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [21:37:14] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) [21:40:33] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > I have created a patch to build docker images of Spark version 3.3.3 As of this writing, `pyspark=3.3.3` is [[ https://an... [22:37:03] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10Epidosis) Thanks. I retried the previous one changing "query" into "sparql", but something is still mistak...