[01:16:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:58] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:27:39] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10REsquito-WMF) @Ottomata That should work for us, we have https://ph... [01:58:25] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Chlod) @Ottomata Also good here. Canary event filtering was released... [02:47:58] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:58] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:57:03] stevemunene: I’ll work today on making sure puppet is idempotent on an-test-client1002. My work on puppet using the skein certificate hit a dead end [07:57:19] * brouberol waves good morning [08:00:16] (EventgateValidationErrors) firing: ... [08:00:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:43:11] (EventgateValidationErrors) resolved: ... [08:43:11] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:44:30] (EventgateValidationErrors) firing: ... [08:44:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:49:30] (EventgateValidationErrors) resolved: ... [08:49:31] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:50:33] FYI this CR should remove the need to disable puppet on an-test-client1002 (which was done to avoid getting "Puppet performing a change on every puppet run " alerts): https://gerrit.wikimedia.org/r/c/operations/puppet/+/971196 [08:55:45] (EventgateValidationErrors) firing: ... [08:55:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:05:45] (EventgateValidationErrors) firing: ... [09:05:46] (2) eventgate-analytics-external stream eventlogging_UniversalLanguageSelector validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationEr [09:10:45] (EventgateValidationErrors) firing: ... [09:10:46] (2) eventgate-analytics-external stream eventlogging_UniversalLanguageSelector validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationEr [09:15:21] also, welcome back btulis! [09:15:30] * btullis [09:23:50] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10dcausse) @bkink thanks for triggering the import, could update the task description with the dump files you used? (needed because we have to explicit... [09:27:10] Morning all. [09:33:01] I'm catching up a week's worth of backscroll and emails etc. Feel free to let me know if there's anything you'd like me to look at this morning. [09:53:28] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Data-Platform: Figure out a way to automatize deployment of the spark assembly file - https://phabricator.wikimedia.org/T336513 (10BTullis) [09:58:30] I'd like to work out a good time this week to deploy this patch, if possible: Deploy multiple spark shufflers for yarn to production | https://gerrit.wikimedia.org/r/c/operations/puppet/+/964008 [10:53:11] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:43:17] (03CR) 10Milimetric: [C: 03+2] build: Remove 'wmui-base' dependency, has never been used anyways [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/971605 (https://phabricator.wikimedia.org/T334934) (owner: 10VolkerE) [11:44:45] (03Merged) 10jenkins-bot: build: Remove 'wmui-base' dependency, has never been used anyways [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/971605 (https://phabricator.wikimedia.org/T334934) (owner: 10VolkerE) [12:00:28] 10Data-Platform-SRE: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) I am running the following to create a binary backup of the mariadb instance on an-coord1002. ` btullis@cumin1001:~$ sudo transfer.py --type=xtrabackup an-coord1002.eqiad.wmnet:/run/mysqld/mysqld.soc... [12:33:05] 10Data-Platform-SRE: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) The output of that command was as follows: ` 2023-11-06 11:57:30 INFO: About to transfer /run/mysqld/mysqld.sock from an-coord1002.eqiad.wmnet to ['an-mariadb1001.eqiad.wmnet']:['/srv/sqldata'] (412... [12:44:46] 10Data-Platform-SRE: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) Cleaned up and re-enabled puppet on an-mariadb1001. Icinga is green. Repeating the above steps with an-mariadb1002. [12:49:33] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [13:10:16] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) The simpler avenue of using systemd timers seems to work nicely: ` brouberol@an-test-client1002:~$ sudo openssl x509 -in /srv/airflow-analytics_test/.skein/skein.crt -text | grep After... [13:11:01] (EventgateValidationErrors) firing: ... [13:11:01] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:14:08] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10SD0001) The gunicorn migration sounds like an unlikely culprit, since it's the db connections referenced here - which are managed by pymysql in any case. [13:16:36] btullis: if you have 2 minutes, this https://gerrit.wikimedia.org/r/c/operations/puppet/+/971947 would enable automatic skein certificate on all launchers, now that I've tested that the service works. [13:18:33] also, it seems that the `Hosts` stanza of https://gerrit.wikimedia.org/r/c/operations/puppet/+/971942/ causes the pcc job to fail [13:19:42] !log disable puppet on druid1004 and druid10[09-11] to Onboard new druid1009 to the ZooKeeper cluster for `druid-public-eqiad` cluster [13:19:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:20:14] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10jcrespo) > the given socket does not have a known format I think it is because it doesn't know how to transform that into a datadir, as it assumes all section names are documented on... [13:25:43] !log stop and disable zookeper on druid1004 T336042 [13:25:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:25:46] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [13:30:02] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10jcrespo) Ah, I see the issue: https://github.com/wikimedia/operations-software-transferpy/blob/cd9027a9beee2cf2ae51b2b6f1be216637775bf9/transferpy/Transferer.py#L244 The port file is n... [13:32:36] !log restart zookeper leader to pick up new host druid1009 T336042 [13:32:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:32:39] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [13:32:45] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) 05In progress→03Resolved I've enabled monthly renewal of skein certificates (and we now also have alerting based on new prometheus metrics reflecting the certificate expiration date, as a sec... [13:33:40] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [13:43:37] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) >>! In T336042#9284972, @Stevemunene wrote: > Zookeper stopped on `druid1005`, `druid1011` is now the new leader. > > ` > stevemunene@druid1011:~$ echo mntr | nc localhost 218... [13:44:13] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) [13:53:28] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) @pfischer I think there might be 2 issues at play here. The first one might indeed be a permission issue. As an admin in datahub, I only see the following: {F41457822} No config except... [13:57:36] !log roll-restart druid public workers to pick up a new zookeeper node druid1009. T336042 [13:57:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:57:48] T336042: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 [13:58:11] (SystemdUnitFailed) firing: (2) druid-coordinator.service Failed on druid1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:01] btullis: am I admin in datahub or do I have a more regular role? I'm investigating potential ACL issues related to https://phabricator.wikimedia.org/T344989 [13:59:26] Checking now. [14:00:00] PROBLEM - aqs endpoints health on aqs1016 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:06] PROBLEM - aqs endpoints health on aqs2002 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:32] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:46] PROBLEM - aqs endpoints health on aqs2012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:00:52] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:02] brouberol: I adminified you. You should be able to check by visiting here: https://datahub.wikimedia.org/group/urn:li:corpGroup:76fbf709-8faa-47e0-b31e-dee18a1b403d/members [14:01:12] PROBLEM - aqs endpoints health on aqs2009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:20] thank you! [14:01:28] PROBLEM - aqs endpoints health on aqs1018 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:30] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:01:56] PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:02:08] Hmm. I'm a bit concerned that these AQS healthcheck endpoints checks are real errors. [14:02:10] PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:02:14] PROBLEM - aqs endpoints health on aqs2010 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:02:16] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:03] stevemunene: These are likely related to druid health. What's the latest, as far as you are concerned? [14:03:11] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:14] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:14] PROBLEM - aqs endpoints health on aqs2009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:30] RECOVERY - aqs endpoints health on aqs1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:32] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:32] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:51] RECOVERY - aqs endpoints health on aqs2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:06] RECOVERY - aqs endpoints health on aqs1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:06] Oh good, they're coming back. [14:04:08] btullis: currently doing a roll restart of the druid public cluster, but nothing screaming so far [14:04:12] RECOVERY - aqs endpoints health on aqs2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:14] RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:18] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:18] RECOVERY - aqs endpoints health on aqs2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:19] RECOVERY - aqs endpoints health on aqs2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:00] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:00] RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:21] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:32] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:40] PROBLEM - aqs endpoints health on aqs2004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:50] PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:54] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:54] PROBLEM - aqs endpoints health on aqs2009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:11:56] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:08] PROBLEM - aqs endpoints health on aqs1018 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:10] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:18] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:18] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:19] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:36] PROBLEM - aqs endpoints health on aqs2012 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:44] PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:13:22] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:13:28] PROBLEM - aqs endpoints health on aqs2003 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:13:46] RECOVERY - aqs endpoints health on aqs2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:08] RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:14] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:14] RECOVERY - aqs endpoints health on aqs2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:15] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:28] RECOVERY - aqs endpoints health on aqs1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:30] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:32] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:14:42] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) Ah, got it. It is an envoy setting. https://www.envoyproxy.io/docs/envoy/lates... [14:15:00] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [14:15:01] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:15:01] RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:15:09] RECOVERY - aqs endpoints health on aqs2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:16:21] PROBLEM - aqs endpoints health on aqs1020 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:17:23] RECOVERY - aqs endpoints health on aqs1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:18:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:25] PROBLEM - aqs endpoints health on aqs2010 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:19:27] RECOVERY - aqs endpoints health on aqs2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:23:19] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:24:18] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) >>! In T349032#9308383, @SD0001 wrote: > The gunicorn migration sounds like an unlikely culprit, since it's the db connections referenced here - which are managed by pymysql in any case. Did I install the db correctl... [14:25:31] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:38:20] RECOVERY - aqs endpoints health on aqs2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:42:31] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:53:11] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:53:31] 10Data-Platform-SRE: Restore datahubadmin group to the admin datahub policy - https://phabricator.wikimedia.org/T350589 (10brouberol) [14:57:13] (03CR) 10Milimetric: [C: 03+2] Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [14:57:21] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update schema of mediawiki_wikitext_* [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [14:57:36] (03CR) 10Milimetric: [C: 03+2] Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761) (owner: 10Milimetric) [15:05:13] (DiskSpace) firing: Disk space druid1004:9100:/srv 3.259% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=druid1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:08:44] (03Merged) 10jenkins-bot: Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761) (owner: 10Milimetric) [15:10:48] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) [15:12:35] 10Data-Platform-SRE: Restore datahubadmin group to the admin datahub policy - https://phabricator.wikimedia.org/T350589 (10brouberol) I have found the following policy definition mapping the policy to en empty list of group: ` MariaDB [datahub]> SELECT * FROM metadata_aspect_v2 WHERE aspect = 'dataHubPolicyInfo... [15:16:48] PROBLEM - Disk space on druid1004 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1004&var-datasource=eqiad+prometheus/ops [15:21:16] (03PS5) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) [15:21:18] (03CR) 10Milimetric: [V: 03+2] Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [15:22:19] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10Ottomata) > message_key_field defines a record <-> event mapping Oh!... [15:22:56] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) a:03brouberol [15:24:05] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) @gmodena @pfischer Can you tell me if you belong to a given datahub group? I should be able to see this on my own, but I'd like a confirmation, as I'm not yet very knowledgeable about dat... [15:24:40] (03PS1) 10Milimetric: Update project namespace map view [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/971978 (https://phabricator.wikimedia.org/T350489) [15:27:24] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/971978 (https://phabricator.wikimedia.org/T350489) (owner: 10Milimetric) [15:27:26] (03PS2) 10Milimetric: Update project namespace map view [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/971978 (https://phabricator.wikimedia.org/T350489) [15:30:13] (DiskSpace) resolved: Disk space druid1004:9100:/srv 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=druid1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:31:48] (03CR) 10Ottomata: Adds new readme (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/968714 (https://phabricator.wikimedia.org/T349729) (owner: 10Kimberly Sarabia) [15:37:07] RECOVERY - Disk space on druid1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1004&var-datasource=eqiad+prometheus/ops [15:44:13] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10Ottomata) kafkacat doesn't work? ` 15:41:19 [@stat1004:/home/otto] $ kafkacat -L -b kafka-jumbo1007.eqiad.wmnet:9092 ... topic "mediawiki_CirrusSearchRequestSet" with 12 partitions: partition... [15:48:16] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10brouberol) @BTullis I see that both Peter and Gabriele have no associated role (either reader, editor or admin), and more surprisingly no group, even `wmf`). Meaning that I think they are covered by... [15:54:07] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar, 10Epic, 10Kubernetes: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10Gehel) p:05Triage→03Medium [15:55:16] 10Data-Platform-SRE, 10Discovery-Search: Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10Gehel) p:05Triage→03High [15:55:45] 10Data-Platform-SRE, 10Discovery-Search (Current work): Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10Gehel) [16:12:11] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [16:21:35] (03Abandoned) 10Milimetric: [WIP] working on understanding and testing page history and quality [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468678 (owner: 10Milimetric) [16:24:04] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [16:25:45] (03PS1) 10Milimetric: Update changelog for v0.2.24 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/971985 [16:25:56] (03CR) 10Milimetric: [C: 03+2] Update changelog for v0.2.24 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/971985 (owner: 10Milimetric) [16:28:31] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10Marostegui) All done, ready for the views creation. [16:28:40] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10Marostegui) All done, ready for the views creation. [16:28:48] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10Marostegui) All done, ready for the views creation. [16:28:56] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10Marostegui) All done, ready for the views creation. [16:33:34] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10BTullis) The views were generated, but the cookbook failed when attempting to run the DNS step. ` ----- OUTPUT of 'source /root/nov...ca-dns --aliases' -----... [16:33:46] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10BTullis) a:03BTullis [16:37:32] 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10Marostegui) Confirmed that I can query the view just fine and see all the rows. So it might be indeed just related to the DNS and not affecting the data underneath. Once it... [16:38:32] (03Merged) 10jenkins-bot: Update changelog for v0.2.24 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/971985 (owner: 10Milimetric) [16:39:50] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) >>! In T284150#9308406, @jcrespo wrote: > I will change that to use the section_ports file instead, where analytics_meta is a known section. Thanks @jcrespo - I think that would... [16:41:35] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [16:42:43] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) [16:43:43] 10Data-Platform-SRE, 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10Gehel) [16:44:41] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on an-test-client1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:05:46] (EventgateValidationErrors) resolved: ... [17:05:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:07:16] (EventgateValidationErrors) firing: ... [17:07:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:12:15] (EventgateValidationErrors) resolved: ... [17:12:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:16:28] Starting build #129 for job analytics-refinery-maven-release-docker [17:18:45] (EventgateValidationErrors) firing: ... [17:18:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:23:06] 10Data-Engineering, 10Documentation: Document destination_event_service Event Platform stream configuration - https://phabricator.wikimedia.org/T313859 (10TBurmeister) [17:30:23] Project analytics-refinery-maven-release-docker build #129: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/129/ [17:33:55] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [17:34:50] Starting build #88 for job analytics-refinery-update-jars-docker [17:35:14] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.24 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/971433 [17:35:15] Project analytics-refinery-update-jars-docker build #88: 09SUCCESS in 24 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/88/ [17:37:36] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add dga.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/969998 (https://phabricator.wikimedia.org/T350229) (owner: 10Gerrit maintenance bot) [17:38:30] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add bjn.wikiquote to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970000 (https://phabricator.wikimedia.org/T350235) (owner: 10Gerrit maintenance bot) [17:38:39] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add zgh.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970004 (https://phabricator.wikimedia.org/T350241) (owner: 10Gerrit maintenance bot) [17:39:00] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add bbc.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970838 (https://phabricator.wikimedia.org/T350373) (owner: 10Gerrit maintenance bot) [17:39:57] (03CR) 10Milimetric: [C: 03+2] "ask me about this if you want the context... we've been thinking of turning off this allow list feature." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 (owner: 10Clare Ming) [17:39:59] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add slo.wikimedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/970404 (owner: 10Clare Ming) [17:43:23] (03PS2) 10Milimetric: Add refinery-source jars for v0.2.24 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/971433 (owner: 10Maven-release-user) [17:43:39] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.24 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/971433 (owner: 10Maven-release-user) [18:26:33] 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10gmodena) Thanks for looking into this @brouberol. > @pfischer @gmodena could you tell me if you can see anything under Properties in https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dat... [18:38:22] !log deployed refinery-source, starting to deploy analytics airflow dags [18:38:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:04:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:15:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:53] 10Data-Platform-SRE, 10Epic: [Epic] define a strategy around alerting for Data Platform SRE and implement it - https://phabricator.wikimedia.org/T345698 (10Gehel) [19:26:55] 10Data-Platform-SRE, 10Discovery-Search (Current work): Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10Gehel) [19:27:15] 10Data-Platform-SRE: Search Platform Airflow jobs: Identify dependencies and configure alerts - https://phabricator.wikimedia.org/T350499 (10Gehel) [20:18:11] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:45] (EventgateValidationErrors) resolved: ... [21:08:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:16:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:13] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:55] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Alright, just deployed... [21:33:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:33:56] (DruidSegmentsUnavailable) firing: (4) More than 10 segments have been unavailable for mediawiki_history_reduced_2023_06 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [21:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:04] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) Okay, I just applied the prestop_sleep settings to all eventgates.... [21:51:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:11] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Let's keep an aye on on... [21:54:30] 10Data-Engineering, 10serviceops, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) a:03Ottomata [22:00:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:55] something's up with the cluster, a few of us are running basic queries like counting simple dataframes and getting long garbage collection times and lots of failed executors on timeouts with "container on bad node" messages. [22:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:50:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:07] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) Current status: flink-operator is listening for rdf-streaming-updater rdf-stream... [23:06:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:41] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:11] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed