[05:43:37] <icinga-wm>	 PROBLEM - Host an-worker1108 is DOWN: PING CRITICAL - Packet loss = 100%
[08:40:42] <wikibugs>	 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) At your service o>
[09:10:58] <btullis>	 !log reboot an-worker1108 as it was spinning with soft CPU lockups
[09:10:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:16:16] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) a:05KCVelaga_WMF→03JAnstee_WMF All the numbers align now for grants now, final comparisons at https://docs.google.com/spreadsheets/d/1smlxmLZN3igND0vW1Zhsr5BRnXgWxx_zbrd5rxMhkqc/edit#gid...
[09:19:43] <icinga-wm>	 RECOVERY - Host an-worker1108 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[11:31:28] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Airflow configuration file in puppet to be compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10BTullis)
[11:31:46] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Verify DAG Compatibility with version V2.3.4 of Airflow  - https://phabricator.wikimedia.org/T309552 (10BTullis)
[11:33:58] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Verify DAG Compatibility with version V2.3.4 of Airflow - https://phabricator.wikimedia.org/T309552 (10BTullis) I have updated the title and description to add clarity to the specific purpose of this ticket.  Also please see the following note from...
[11:35:39] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06): Implement periodical cleaning of Airflow databases - https://phabricator.wikimedia.org/T322036 (10BTullis) We believe that we will be able to use the new `airflow db clean` feature present in Airflow 2.3, once we have completed the upgrade to versio...
[11:55:42] <wikibugs>	 (03PS1) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458)
[12:15:51] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis)
[12:26:23] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) I'm testing this version on an-tool1005 by following the process outlined here: https://wikitech.w...
[12:36:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2035%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[12:39:09] <wikibugs>	 (03CR) 10Aqu: [C: 03+2] Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu)
[12:41:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2035%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[12:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu)
[12:49:36] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Unfortunately, it didn't work. I received the following error after activating the `venv` and atte...
[12:52:06] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) This is done, all servers successfully joined the cluster. {F35844766}
[12:52:36] <wikibugs>	 (03PS2) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458)
[12:53:59] <wmf-insecte>	 Starting build #115 for job analytics-refinery-maven-release-docker
[13:05:50] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #115: 09SUCCESS in 11 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/115/
[13:22:02] <wikibugs>	 (03CR) 10Joal: "Missing the comment about ownership and permissions to be set to the folder - otherwise good for me." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu)
[13:23:49] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[13:30:57] <wmf-insecte>	 Starting build #74 for job analytics-refinery-update-jars-docker
[13:31:13] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.10 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/865626
[13:31:13] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #74: 09SUCCESS in 16 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/74/
[13:38:15] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10JArguello-WMF) a:03Stevemunene
[13:39:37] <wikibugs>	 10Data-Engineering, 10Epic, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10JArguello-WMF)
[13:46:46] <wikibugs>	 (03PS7) 10Aqu: Declare the HDFS usage dataset in hive metastore [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169)
[13:48:15] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis)
[13:49:14] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) 05Open→03In progress p:05Triage→03High
[13:49:16] <wikibugs>	 10Data-Engineering, 10Epic, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis)
[14:03:15] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6b668b5c-bee2-4afe-b7ce-8ce1e95a7866) set...
[14:06:01] <btullis>	 !log rebuilding an-tool1005 as bullseye to test superset 1.5.2 upgrade
[14:06:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:07:43] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Starting the rebuild process based on this: https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/...
[14:09:27] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Changed the boot order back to local disk from another shell on the ganeti master. ` btullis@ganet...
[14:23:59] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu)
[14:30:25] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Destroyed the existing puppet certificate. ` btullis@puppetmaster1001:~$ sudo puppet cert destroy...
[14:30:42] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10Stevemunene) Checking the cert status on one of the cp hosts.  ` stevemunene@cp1077:~$ cat /etc/varnish...
[14:31:11] <wikibugs>	 (03CR) 10Aqu: [C: 03+2] "Thanks @joal for the review." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[14:32:52] <wikibugs>	 (03CR) 10Aqu: [C: 03+2] "Thanks @joal & @mforns for the reviews." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu)
[14:40:21] <wikibugs>	 (03CR) 10Aqu: [V: 03+2 C: 03+2] "Adding refinery/source 0.2.10 to refinery." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/865626 (owner: 10Maven-release-user)
[14:49:58] <wikibugs>	 (03CR) 10Aqu: [V: 03+2 C: 03+2] Add script for HDFS XML fsimage to bin folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) (owner: 10Aqu)
[14:50:09] <wikibugs>	 (03CR) 10Aqu: [V: 03+2 C: 03+2] Declare the HDFS usage dataset in hive metastore [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) (owner: 10Aqu)
[15:11:08] <wikibugs>	 10Data-Engineering, 10Cassandra, 10Epic, 10Platform Team Workboards (Platform Engineering Reliability): Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) 05Open→03Resolved Ding ding! Resolved an epic.
[15:13:46] <btullis>	 !log roll-restarting AQS to pick up new mediawiki_history_reduce snapshot
[15:13:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:27:46] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10ntsako)
[15:39:12] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10Ottomata) @BTullis @elukey should we be doing this with the new PKI, rather than cergen? {T296064}
[15:43:55] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10elukey) @Ottomata not sure how much time we have, in theory all Jumbo brokers will need to be able to a...
[15:45:09] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10Ottomata) Okay, let's just regen the new certs using cergen for now then.
[15:47:42] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) Yeah, I agree with @elukey. It's definitely a good case, but we only have until next Tuesday b...
[15:52:35] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) @Stevemunene here's my draft action plan for how I would go about this upgrade. I would do som...
[15:57:11] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) My suspicion is that the varnishkafka instance will automatically restart when the certificate...
[16:24:52] <aqu>	 !log Deploying analytics/refinery (HDFS usage scripts)
[16:24:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:46:40] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event eriven Python services - https://phabricator.wikimedia.org/T324689 (10gmodena)
[16:46:55] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10gmodena)
[16:48:38] <wikibugs>	 (03PS3) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458)
[16:52:02] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10Ottomata)
[17:03:03] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:01] <wikibugs>	 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF)
[17:05:49] <wikibugs>	 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF) a:03JAllemandou
[17:06:43] <wikibugs>	 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF) p:05Low→03High
[17:07:03] <wikibugs>	 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JArguello-WMF)
[17:26:20] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > It'd be nice if the value of watchNamespaces didn't have to be hardcoded when the flink-operator is deployed Oh, [[ htt...
[17:35:42] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10lbowmaker)
[17:47:38] <aqu>	 !log Adding hdfs/usage folder dataset in HDFS
[17:47:39] <aqu>	 sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mkdir -p /wmf/data/hdfs/usage
[17:47:39] <aqu>	 sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-admins /wmf/data/hdfs
[17:47:39] <aqu>	 sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R 750 /wmf/data/hdfs
[17:47:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:57:40] <aqu>	 !log Adding raw hdfs fsimage dir in HDFS (an-launcher1002)
[17:57:40] <aqu>	 sudo -u analytics kerberos-run-command analytics hdfs dfs -mkdir -p /wmf/data/raw/hdfs_xml_fsimage
[17:57:40] <aqu>	 sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-admins /wmf/data/raw/hdfs_xml_fsimage
[17:57:40] <aqu>	 sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R 750 /wmf/data/raw/hdfs_xml_fsimage
[17:57:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:10:05] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Well, this is getting better, but I've still got a couple of issues.  1: envoy isn't running on an...
[18:19:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:33:34] <wikibugs>	 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[18:41:30] <wikibugs>	 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10Varnent)
[19:37:04] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10lbowmaker)
[20:05:33] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > `kubernetes.operator.dynamic.namespaces.enabled` Ah, but the upstream helm chart does not work with this feature becaus...
[20:06:42] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Perhaps, we could wildcard the namespaces that the flink-operator is allowed to modify? E.g. namespace that starts with...
[20:18:38] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:24:12] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:57:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp5018 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5018%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[23:02:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp5018 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5018%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages