[02:45:59] (PuppetFailure) firing: Puppet has failed on an-conf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:16:59] (PuppetFailure) firing: Puppet has failed on dumpsdata1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:42:00] (PuppetFailure) firing: (2) Puppet has failed on dumpsdata1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:46:14] (PuppetFailure) firing: Puppet has failed on an-conf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:33:16] (EventgateValidationErrors) firing: ... [07:33:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [07:55:18] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Implement a spark job that converts a RDF triples table into a RDF file format - https://phabricator.wikimedia.org/T350106 (10dcausse) [08:17:13] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10CodeReviewBot) stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/532 switch druid host to index to the druid-public cluster and datahub i... [08:45:08] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10CodeReviewBot) stevemunene closed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/532 switch druid host to index to the druid-public cluster and datahub i... [08:47:14] thanks for the review jbond, I learned a lot [08:59:35] 10Data-Platform-SRE: Decom search-loader VMs still using Buster - https://phabricator.wikimedia.org/T350078 (10Gehel) [09:01:08] (PuppetFailure) resolved: Puppet has failed on an-conf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:01:59] (PuppetFailure) resolved: (2) Puppet has failed on dumpsdata1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:06:38] brouberol: no roblem :) [09:21:04] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [09:25:07] headsup: I've replaced the self-signed/generated skein certificate on an-test1002.eqiad.wmnet by a certificate generated by our cfssl PKI. I'm hoping that hourly `aqs_hourly` job will still be able to be executed onto Spark (via Skein) without issue. If not, I'll revert [09:25:18] 10Data-Platform-SRE: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) a:03Stevemunene [09:26:06] !log I replaced the self-signed skein certificate by one issued by our cfssl PKI on an-test1002 - T329398 [09:26:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:26:10] T329398: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 [09:35:11] jbond: when cfssl generates a chained certificate, do we have to move the .chain.pem file into /etc/ca-certificates as well? [09:36:36] sorry let me rephrase: to validate the certificate, openssl needs to know about the intermediate CA certificate, which seems to be the .chain.pem file. Do we already have this intermediate CA cert in our chain of trust, deployed on the hosts, or we I ned to add it ? [09:36:46] *or do I need [09:39:24] brouberol: no thats not needed [09:40:08] when you configure yuo service e.g. apache you need to use the chained version f the certificate so that apache will send the client the leaf and intermediate certificate [09:40:24] all host have the pki root certificate which allows them to validate that chain [09:41:00] perfect, thanks [10:01:59] (PuppetDisabled) firing: Puppet disabled on dbstore1007:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:06:49] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) I've played around with the cfssl-generated chained certificate, to see whether I could have Skein accept it as a valid x509 certificate. ` brouberol@an-test-client1002:~$ sudo mv /srv/airflow-a... [10:31:06] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10CodeReviewBot) stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/533 switch druid host to index to the druid-public cluster and datahub i... [10:31:14] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) Actually, I realized that I had only changed the _certificate_ but not the private key.. ` # the original skein.crt had been restored at this point brouberol@an-test-client1002:/srv/airflow-anal... [11:17:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:16] (EventgateValidationErrors) firing: ... [11:33:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [11:38:03] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) @Ottomata moving our convo to phab: The benefit of this phab/... [11:41:11] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) > I have an hypothesis, but I posted a question to user@flink.a... [12:53:07] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) @Ottomata ack > If you think it is worth the cost, then let'... [12:56:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10Ottomata) > my search-foo might have failed me It did not, we have not really used keys befo... [12:58:04] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10Ottomata) Okay, yes, sounds good! [13:32:25] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/da... [14:02:14] (PuppetDisabled) firing: Puppet disabled on dbstore1007:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:16:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:36:29] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:37:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:36] 10Data-Engineering: Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10Ottomata) [15:04:44] 10Data-Engineering, 10Data Engineering and Event Platform Team: Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10Ottomata) [15:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:16] (EventgateValidationErrors) firing: ... [15:33:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:47:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:15] (03PS1) 10Clare Ming: Add slo.wikimedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 [16:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:12] (03CR) 10Clare Ming: "not sure who decides whether something should be whitelisted - but here's a patch to add this site that's been sending DE alerts" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 (owner: 10Clare Ming) [16:06:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:15:54] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [16:17:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:39] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with... [16:32:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:14] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10odimitrijevic) [16:34:22] 10Data-Engineering, 10Data-Catalog, 10Documentation: Data Catalog Documentation Style Guide - https://phabricator.wikimedia.org/T310229 (10odimitrijevic) 05Open→03Resolved This was delivered as part of the "documentathon": https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Data_Catalog_... [16:53:28] Are there guidelines about how to share processing time? I have a job I'd like to run on stat1009 for a few hours, set to 16+ x vCPU if nobody minds. [16:59:14] 10Data-Engineering, 10Data Engineering and Event Platform Team: [Maintenance] Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10Ahoelzl) [17:14:07] (03CR) 10Michael Große: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970415 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [17:14:18] (03CR) 10Michael Große: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970416 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [17:14:24] If anyone needs to kill it, the process is beam.smp(483804) [17:14:30] (03CR) 10Michael Große: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/970417 (https://phabricator.wikimedia.org/T348644) (owner: 10Michael Große) [17:51:40] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) Did a little investigating today. Got flame graphs for node10 and node18... [18:02:14] (PuppetDisabled) firing: Puppet disabled on dbstore1007:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [18:20:15] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) - https://github.com/nodejs/node/issues/42511 - https://github.com/nodejs... [18:31:35] 10Data-Engineering, 10ChangeProp, 10Data Engineering and Event Platform Team, 10observability, and 2 others: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics - https://phabricator.wikimedia.org/T350180 (10Ottomata) [18:32:59] 10Data-Engineering, 10ChangeProp, 10Data Engineering and Event Platform Team, 10observability, and 2 others: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics - https://phabricator.wikimedia.org/T350180 (10Ottomata) We can probably upgrade to prom-client 12.0.0 and get GC metric... [18:37:28] 10Data-Engineering, 10ChangeProp, 10Data Engineering and Event Platform Team, 10observability, and 2 others: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics - https://phabricator.wikimedia.org/T350180 (10Ottomata) ^ doesn't look like it :( [18:49:55] 10Data-Platform-SRE, 10Patch-For-Review: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) {F41417628} I'm not sure I understand why, but for as soon as I deploy the PKI-generated certificate/private key, the `aqs_hourly` jobs start being rescheduled indefinitely.... [18:57:09] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) [19:02:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:57] (03PS2) 10Clare Ming: Add slo.wikimedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 [19:04:04] (03PS3) 10Clare Ming: Add slo.wikimedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 [19:04:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:12] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) [19:16:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:45] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) https://quarry-test.wmcloud.org offers a running, but not working, quarry on k8s. When I run a query it is giving: ` Can't connect to MySQL server on 'enwiki' ([Errno -2] Name or service not known) ` Presumably someth... [19:28:10] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) [19:33:31] (EventgateValidationErrors) firing: ... [19:33:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:37:54] 10Data-Engineering, 10ChangeProp, 10Data Engineering and Event Platform Team, 10observability, and 2 others: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics - https://phabricator.wikimedia.org/T350180 (10Ottomata) Or, maybe? I tried today and couldn't get the GC stats I wante... [19:47:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:51] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:08] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10SD0001) @rook This is due to misconfigured db config. I can see config.yaml has `REPLICA_DOMAIN: ''` which could be overriding the valid value provided a few lines above it. [20:34:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:50] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:50] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) >>! In T349032#9296495, @SD0001 wrote: > @rook This is due to misconfigured db config. I can see config.yaml has `REPLICA_DOMAIN: ''` which could be overriding the valid value provided a few lines above it. ooo so it... [21:08:48] 10Data-Platform-SRE, 10Discovery-Search, 10Epic: Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10bking) [21:15:26] 10Data-Platform-SRE, 10Discovery-Search, 10Epic: Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10bking) Per pairing session today, the above script needs a small bit of work to fetch the entire document. We'd use it to compare the... [21:15:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:50] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:24] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Estimate cirrus streaming updater's usage of MWAPI - https://phabricator.wikimedia.org/T350185 (10bking) [21:38:16] (EventgateValidationErrors) resolved: ... [21:38:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [22:02:14] (PuppetDisabled) firing: Puppet disabled on dbstore1007:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [22:39:29] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) Progress report: `wdqs1022`: started reload 2023-10-24 0000 UTC . Munging finished 2023-10-26 0003 UTC. So far, we've processed 409/1104 munge... [23:11:27] 10Data-Engineering, 10Privacy Engineering: Investigate releasing historical top-pageview-per-country data - https://phabricator.wikimedia.org/T299627 (10Htriedman) 05Open→03Resolved a:03Htriedman Update (very late but still necessary): As of Feb 2023, this data request has been completed! Daily data fro... [23:24:56] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Htriedman) 05Open→03Resolved a:03Htriedman