[05:43:37] PROBLEM - Host an-worker1108 is DOWN: PING CRITICAL - Packet loss = 100% [08:40:42] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) At your service o> [09:10:58] !log reboot an-worker1108 as it was spinning with soft CPU lockups [09:10:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:16:16] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) a:05KCVelaga_WMF→03JAnstee_WMF All the numbers align now for grants now, final comparisons at https://docs.google.com/spreadsheets/d/1smlxmLZN3igND0vW1Zhsr5BRnXgWxx_zbrd5rxMhkqc/edit#gid... [09:19:43] RECOVERY - Host an-worker1108 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [11:31:28] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Airflow configuration file in puppet to be compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10BTullis) [11:31:46] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Verify DAG Compatibility with version V2.3.4 of Airflow - https://phabricator.wikimedia.org/T309552 (10BTullis) [11:33:58] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Verify DAG Compatibility with version V2.3.4 of Airflow - https://phabricator.wikimedia.org/T309552 (10BTullis) I have updated the title and description to add clarity to the specific purpose of this ticket. Also please see the following note from... [11:35:39] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Implement periodical cleaning of Airflow databases - https://phabricator.wikimedia.org/T322036 (10BTullis) We believe that we will be able to use the new `airflow db clean` feature present in Airflow 2.3, once we have completed the upgrade to versio... [11:55:42] (03PS1) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458) [12:15:51] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [12:26:23] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) I'm testing this version on an-tool1005 by following the process outlined here: https://wikitech.w... [12:36:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2035%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:39:09] (03CR) 10Aqu: [C: 03+2] Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [12:41:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2035%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:46:37] (03Merged) 10jenkins-bot: Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [12:49:36] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Unfortunately, it didn't work. I received the following error after activating the `venv` and atte... [12:52:06] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) This is done, all servers successfully joined the cluster. {F35844766} [12:52:36] (03PS2) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458) [12:53:59] Starting build #115 for job analytics-refinery-maven-release-docker [13:05:50] Project analytics-refinery-maven-release-docker build #115: 09SUCCESS in 11 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/115/ [13:22:02] (03CR) 10Joal: "Missing the comment about ownership and permissions to be set to the folder - otherwise good for me." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu) [13:23:49] (03CR) 10Joal: [C: 03+1] "LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [13:30:57] Starting build #74 for job analytics-refinery-update-jars-docker [13:31:13] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.10 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/865626 [13:31:13] Project analytics-refinery-update-jars-docker build #74: 09SUCCESS in 16 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/74/ [13:38:15] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10JArguello-WMF) a:03Stevemunene [13:39:37] 10Data-Engineering, 10Epic, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10JArguello-WMF) [13:46:46] (03PS7) 10Aqu: Declare the HDFS usage dataset in hive metastore [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) [13:48:15] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) [13:49:14] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) 05Open→03In progress p:05Triage→03High [13:49:16] 10Data-Engineering, 10Epic, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [14:03:15] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6b668b5c-bee2-4afe-b7ce-8ce1e95a7866) set... [14:06:01] !log rebuilding an-tool1005 as bullseye to test superset 1.5.2 upgrade [14:06:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:07:43] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Starting the rebuild process based on this: https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/... [14:09:27] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Changed the boot order back to local disk from another shell on the ganeti master. ` btullis@ganet... [14:23:59] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu) [14:30:25] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Destroyed the existing puppet certificate. ` btullis@puppetmaster1001:~$ sudo puppet cert destroy... [14:30:42] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10Stevemunene) Checking the cert status on one of the cp hosts. ` stevemunene@cp1077:~$ cat /etc/varnish... [14:31:11] (03CR) 10Aqu: [C: 03+2] "Thanks @joal for the review." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [14:32:52] (03CR) 10Aqu: [C: 03+2] "Thanks @joal & @mforns for the reviews." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu) [14:40:21] (03CR) 10Aqu: [V: 03+2 C: 03+2] "Adding refinery/source 0.2.10 to refinery." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/865626 (owner: 10Maven-release-user) [14:49:58] (03CR) 10Aqu: [V: 03+2 C: 03+2] Add script for HDFS XML fsimage to bin folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu) [14:50:09] (03CR) 10Aqu: [V: 03+2 C: 03+2] Declare the HDFS usage dataset in hive metastore [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu) [15:11:08] 10Data-Engineering, 10Cassandra, 10Epic, 10Platform Team Workboards (Platform Engineering Reliability): Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) 05Open→03Resolved Ding ding! Resolved an epic. [15:13:46] !log roll-restarting AQS to pick up new mediawiki_history_reduce snapshot [15:13:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:27:46] 10Data-Engineering, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10ntsako) [15:39:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10Ottomata) @BTullis @elukey should we be doing this with the new PKI, rather than cergen? {T296064} [15:43:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10elukey) @Ottomata not sure how much time we have, in theory all Jumbo brokers will need to be able to a... [15:45:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10Ottomata) Okay, let's just regen the new certs using cergen for now then. [15:47:42] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) Yeah, I agree with @elukey. It's definitely a good case, but we only have until next Tuesday b... [15:52:35] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) @Stevemunene here's my draft action plan for how I would go about this upgrade. I would do som... [15:57:11] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) My suspicion is that the varnishkafka instance will automatically restart when the certificate... [16:24:52] !log Deploying analytics/refinery (HDFS usage scripts) [16:24:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:46:40] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event eriven Python services - https://phabricator.wikimedia.org/T324689 (10gmodena) [16:46:55] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10gmodena) [16:48:38] (03PS3) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458) [16:52:02] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10Ottomata) [17:03:03] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:01] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF) [17:05:49] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF) a:03JAllemandou [17:06:43] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF) p:05Low→03High [17:07:03] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF) [17:26:20] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > It'd be nice if the value of watchNamespaces didn't have to be hardcoded when the flink-operator is deployed Oh, [[ htt... [17:35:42] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10lbowmaker) [17:47:38] !log Adding hdfs/usage folder dataset in HDFS [17:47:39] sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mkdir -p /wmf/data/hdfs/usage [17:47:39] sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-admins /wmf/data/hdfs [17:47:39] sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R 750 /wmf/data/hdfs [17:47:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:57:40] !log Adding raw hdfs fsimage dir in HDFS (an-launcher1002) [17:57:40] sudo -u analytics kerberos-run-command analytics hdfs dfs -mkdir -p /wmf/data/raw/hdfs_xml_fsimage [17:57:40] sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-admins /wmf/data/raw/hdfs_xml_fsimage [17:57:40] sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R 750 /wmf/data/raw/hdfs_xml_fsimage [17:57:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:10:05] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Well, this is getting better, but I've still got a couple of issues. 1: envoy isn't running on an... [18:19:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:34] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:41:30] 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10Varnent) [19:37:04] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10lbowmaker) [20:05:33] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > `kubernetes.operator.dynamic.namespaces.enabled` Ah, but the upstream helm chart does not work with this feature becaus... [20:06:42] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Perhaps, we could wildcard the namespaces that the flink-operator is allowed to modify? E.g. namespace that starts with... [20:18:38] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:12] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5018 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5018%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:02:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5018 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5018%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages