[00:01:12] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) >>! In T336357#9007132, @odimitrijevic wrote: > @BTullis do th... [03:30:55] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10Peachey88) >>! In T336357#8844321, @BTullis wrote: > The Bishopfox wiki... [04:12:20] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): jsonschema-tools test should fail if fields are removed in new (non major) version - https://phabricator.wikimedia.org/T340765 (10tchin) On the wiki for schema guidelines there's a blanket statement tha... [06:44:22] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10MoritzMuehlenhoff) >>! In T336357#9007388, @BTullis wrote: >>>! In T336... [06:47:02] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10MoritzMuehlenhoff) Maybe a very quick fix with immediate impact is to simply move away from using the local hostname in the From: (Currently root@krb1001.... [08:16:15] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [08:24:17] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) The point value for `floatfmt=".1f"` would also be calculated to make sure that we match it to the maximum decimal place needed :) [08:55:17] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) >>! In T336357#9007689, @MoritzMuehlenhoff wrote: >>>! In T336... [09:34:10] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/454 Update the schema registry used for airflow lineage in test [09:47:55] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:48:08] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:48:25] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:48:43] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:48:53] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:49:07] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:49:20] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:49:30] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:49:40] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:49:53] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:49:57] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [09:50:16] 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Stevemunene) a:05Stevemunene→03Jclark-ctr [10:00:27] 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) [10:26:59] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) The alert is gone and `check_private_data.py -S /run/mysqld/mysqld.s5.sock` is not outputting anything anymore. I think it's good to go now... [11:42:58] 10Data-Engineering, 10AQS2.0, 10PageViewInfo, 10API Platform (AQS 2.0 Roadmap): MediaWiki frequently receives HTTP 500 from AQS (via PageViewInfo extension) - https://phabricator.wikimedia.org/T341634 (10BTullis) This is interesting and I'd like to help track down the source of the errors if I can. However... [11:48:15] (03Abandoned) 10Hnowlan: Add docker-compose environment with cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679295 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [12:26:26] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10jcrespo) @Ladsgroup The wikireplicas don't have the triggers. I don't know if they should or not in general/long term, but given that all other dbs hav... [12:39:43] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) Thanks. Done now. I'll update docs to reflect that. [12:45:13] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10jcrespo) And again, stressing that it may be not needed, but that should be handled on a separate ticket. With that done, I think @BTullis can proceed. [13:17:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:14] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm do we have any update on this? [13:51:51] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) [13:52:11] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Jclark-ctr) [13:52:38] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Jclark-ctr) [13:53:00] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) [13:53:18] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Jclark-ctr) [13:53:38] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) [13:53:58] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Jclark-ctr) [13:54:10] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Jclark-ctr) [13:54:30] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) [13:54:52] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) [13:55:10] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Jclark-ctr) [13:55:28] 10Data-Platform-SRE, 10decommission-hardware, 10ops-eqiad: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Jclark-ctr) [14:11:59] !log roll-restarting zookeeper on druid-public for new JVM version [14:12:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:24:57] 10Data-Platform-SRE: Migrate analytics - https://phabricator.wikimedia.org/T341700 (10Stevemunene) [14:38:45] 10Data-Platform-SRE: Migrate analytics_test airflow instance to bullseye an-test-client1002 - https://phabricator.wikimedia.org/T341700 (10Stevemunene) a:05BTullis→03Stevemunene [15:14:27] 10Data-Platform-SRE, 10Discovery-Search, 10SRE: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) [15:16:24] 10Data-Platform-SRE, 10Discovery-Search, 10SRE, 10vm-requests: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10jbond) [15:17:43] 10Data-Platform-SRE, 10Discovery-Search, 10SRE, 10vm-requests: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10jbond) approved [15:18:26] 10Data-Platform-SRE, 10Discovery-Search, 10SRE, 10vm-requests: eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) [15:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [15:36:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [15:41:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [15:41:27] The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [15:45:13] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [15:45:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:57] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: (2) The mw-page-content-change-enrich Flink cluster in codfw has no registered TaskManagers - TODO - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [16:07:51] ^ aqu: [ops week] This flink notification is a new alert, just introduced by gmodena. We can ignore it for now: https://wikimedia.slack.com/archives/CSV483812/p1689171702286509 [16:16:05] OK thx [16:26:17] !log `sudo cumin A:wikireplicas-all 'maintain-views --replace-all --all-databases --table revision'` for T339037 [16:26:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:31:05] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) p:05Triage→03Medium [18:51:04] 10Data-Engineering, 10Cassandra: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10fkaelin) Can this task be closed as done? [18:55:00] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10Isaac) @BTullis a few more (I was able to create a chart that some people now want to see)! * @Maryana: https://wikitech.wikimedia.org/wiki/User:Maryana * @NHillard-WMF: https://wikitec... [19:44:40] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10Novem_Linguae) [19:46:09] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10Novem_Linguae) [19:47:14] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10Novem_Linguae) [19:47:43] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10rook) This seems to work for me in Superset, does it work for you https://superset.wmcloud.org/superset/sqllab/ ? [19:53:57] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10Rhododendrites) Tried in Superset, but I don't see a way to use it without setting a limit (the results will be much more than the maximum limit). [19:55:39] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10rook) The limit could probably be increased. Could you describe the value in such sizable result sets? [20:09:17] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10Rhododendrites) Just some exploratory research. It's possible someone with better SQL chops than me could figure out everything they need through a bunch of queries, but I tend to just load things in Excel (or to... [20:10:45] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722 (10rook) >>! In T341722#9010264, @Rhododendrites wrote: > Just some exploratory research. It's possible someone with better SQL chops > than me could figure out everything they need through a bunch of queries, > but... [20:23:31] 10Data-Engineering, 10Cassandra: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10Eevans) >>! In T340494#9010004, @fkaelin wrote: > Can this task be closed as done? Ideally, part of provisioning a new dataset would be to work out capacity planning. We don't have muc... [20:23:40] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking) [20:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:25] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search, 10serviceops-radar, 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10bking) [20:46:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:35] 10Data-Engineering, 10Cassandra: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10fkaelin) This is a fairly small dataset - as of June 2023 about 12mb added per month, with about 2GB of data in total so far (as parquet files on hdfs). We do plan on adding additional k...