[01:43:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [01:48:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:13:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:18:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:38:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:39:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5020 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5020%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:43:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:44:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5020 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5020%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:55:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [05:09:16] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Tgr) >>! In T308017#8384646, @Ottomata wrote: > Re-opening to discuss a schema change. > > In https://gerrit.wikim... [05:39:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:44:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:55:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [08:23:33] hm - we have an issue with a dataset - see https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&viewPanel=28&from=now-7d&to=now [08:23:54] I'm gonna spend time investigating which dataset generated this [08:24:01] aqu: good morning! [08:24:33] aqu: I bet you have an hdfs_usage from before 2022-11-20? [08:24:56] If so, it'd be awesome if you could also generate one from today, so that I can compare, please :) [08:25:49] joal: Hello, last one from 2022-10-16. Will make one now. [08:26:02] Thanks a milion aqu :) [08:30:05] (03CR) 10Joal: "I think this should be reverted as with this jar version the job will fail: code is for spark3 with scala2.12, while backend engine is spa" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/857775 (https://phabricator.wikimedia.org/T320860) (owner: 10Mforns) [08:36:14] 10Data-Engineering-Planning, 10Cassandra, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Write dedicated cassandra authorization code to read password from file when loading - https://phabricator.wikimedia.org/T306895 (10JAllemandou) Thank you @BTullis and @Ottomata for taking over this - this will be v... [09:07:24] 10Data-Engineering, 10API Platform (Sprint 01), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10codebug) @BPirkle what is the final decision on the encoding test as my current implementation of this uses MUX [10:55:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [10:57:34] I've acknowledged the alert above. Looks like we had a bump in HDFS files by about 10 million in the past couple of days. [10:57:38] https://usercontent.irccloud-cdn.com/file/oFw9g9SY/image.png [10:58:23] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) To clarify- there is no blocker from SRE team ops to proceed with this, we are eager and waiting for the template to be added on this ticket to... [10:58:26] btullis: indeed - I've asked aqu for a new version of the hdfs_usage so that I can try to undeerstand the reason [10:58:52] joal: Many thanks <3 [11:10:50] joal: (re number of files in hdfs) new jobs we've put in prod should be writing to hdfs:///wmf/data/discovery/cirrus/ [11:11:03] ack dcausse - thanks for letting me know :) [12:42:48] (03PS2) 10DCausse: [WIP] cirrussearch: add fetch_failure schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) [13:10:25] 10Data-Engineering-Planning, 10Data Pipelines: refinery scap deployment to thin nodes is broken - https://phabricator.wikimedia.org/T321506 (10BTullis) I believe that this is fixed already, isn't it? We haven't seen this problem in the weekly train recently, have we? [13:22:58] 10Data-Engineering-Planning, 10Data Pipelines: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10BTullis) +11 [13:45:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5018 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5018%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:50:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5018 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5018%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:58:05] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) Thanks Jaime, Here are the existing sudo permissions applicable to `analytics-admins`: https://github.com/wikimedia/puppet/blob/production/modu... [14:08:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10JArguello-WMF) [14:12:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10Milimetric) [14:18:38] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) @ntsako the final QC, based on recalculated numbers is on [[https://docs.google.com/spreadsheets/d/1smlxmLZN3igND0vW1Zhsr5BRnXgWxx_zbrd5rxMhkqc/edit?pli=1#gid=75628010&range=M1:U3|this shee... [14:20:06] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) a:05JAnstee_WMF→03ntsako @ntsako please assign it back to me once the changes are done. [14:32:01] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) @jcrespo I can make this change once the other approvals have been given. [14:41:15] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) Thanks Ottomata, please use [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ | the template with the checklist ]] I linked to y... [14:46:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10Milimetric) 05Open→03Declined Deciding against Flink, at least for now. Documenting as a [[ https://docs.google.com/spreadsheets/d/1IfTe_eFaf4VE6metf... [15:02:06] joal: ufa, I messed up MediawikiHistory then? [15:07:31] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Jclark-ctr) cephosd1001 E1 U3 Port 3 Cableid# 20220225 cephosd1002 E2 U3 Port 3 Cableid# 20220237 cephosd1003 E3 U3 P... [15:15:05] 10Data-Engineering-Radar, 10Cassandra: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 (10Eevans) [15:23:26] 10Data-Engineering, 10Data Pipelines, 10Platform Engineering: Catalog, Categorize, and Templetize existing scheduled workflows - https://phabricator.wikimedia.org/T282035 (10mforns) [15:23:55] Hi mforns - I think next run will fail if we don't do something :) [15:24:07] hm... [15:27:24] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:34:00] 10Data-Engineering-Planning, 10Data Pipelines: refinery scap deployment to thin nodes is broken - https://phabricator.wikimedia.org/T321506 (10EChetty) 05Open→03Resolved [15:50:12] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) We have been testing these changes on `an-test-client1001` for a while now. Here... [15:51:49] heya aqu - any news of a new dataset instance for today? [15:55:29] joal: I thought you just wanted a checkpoint, sorry. so I stopped at the xml step. I am producing the hive table now... [15:55:41] Man thanks aqu :) [15:55:45] +y [15:59:56] PROBLEM - IPMI Sensor Status on an-worker1148 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:09:28] o/ hi folks! [16:09:43] if you have time for https://gerrit.wikimedia.org/r/c/analytics/refinery/+/858561 before the next train I'd be super grateful :) [16:21:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:37] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) [17:16:52] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) Done, I removed irrelevant parts, if that is okay. [17:30:06] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 04): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) o/ I am working on Flink and flink operator images now: https://gerrit.wikimedia.... [17:31:35] 10Data-Engineering, 10API Platform (Sprint 01), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10BPirkle) 05Open→03Stalled @codebug, let's go with the fasthttp router. That [[ https://gitlab.wikimedia.org/... [17:31:37] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Pageviews Service - https://phabricator.wikimedia.org/T288296 (10BPirkle) [17:31:49] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Thanks @Tgr! At this point it is easy enough to remove, and we can always add it back in later if/when w... [17:35:41] (03PS1) 10Sergio Gimeno: Add user new impact data to the impact homepagemodule [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) [17:38:50] aqu: Heya - Do you have an ETA on the hdfs_usage data? It'd be awesome if it could be before tomorrow [17:53:14] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) >>! In T323280#8410218, @Ottomata wrote: > Done, I removed irrelevant parts, if that is okay. 👍 Sorry to be pedantic about this, it is not me... [18:12:31] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging this, since it has already been deployed manually." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858625 (owner: 10Mforns) [18:29:45] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson an-coord1003 E1 U36 Port 36 Cableid # 20220001 an... [18:30:13] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Jclark-ctr) [18:34:40] (03CR) 10Ottomata: [C: 03+1] oozie: add cache_status to webrequest's druid indexations [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858561 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [18:38:25] (03CR) 10Ottomata: [C: 03+1] navigationtiming: Add skin field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857493 (https://phabricator.wikimedia.org/T323124) (owner: 10Phedenskog) [18:39:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5030 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:44:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5030 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5030%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:45:47] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10nshahquinn-wmf) @xcollazo a month ago, I suggested changing the default source of Conda pac... [19:35:16] (03CR) 10Krinkle: [C: 03+2] navigationtiming: Add skin field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857493 (https://phabricator.wikimedia.org/T323124) (owner: 10Phedenskog) [19:35:48] (03Merged) 10jenkins-bot: navigationtiming: Add skin field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857493 (https://phabricator.wikimedia.org/T323124) (owner: 10Phedenskog) [19:38:08] joal: The process crashed. I'm going to try again. [19:48:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:47] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10mpopov) +1 to switching to conda-forge as the default source [20:01:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:08] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) @nshahquinn-wmf and @mpopov : The way channels are setup right now is as follows... [21:13:10] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10nshahquinn-wmf) >>! In T321088#8411016, @xcollazo wrote: > I could add the following for yo... [23:01:17] (03PS3) 10Milimetric: [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 [23:06:23] (03CR) 10CI reject: [V: 04-1] [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (owner: 10Milimetric) [23:18:45] joal: I have the data and I think we have a winner.