[00:34:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:27] (MediawikiPageContentChangeEnrichAvailability) firing: ... [02:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [06:04:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [06:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:47:44] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 60 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10karapayneWMDE) [09:29:44] (SystemdUnitFailed) firing: (2) jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:59] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) [09:31:03] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning, 10Patch-Needs-Improvement: Deprecate/delete the mw.eventLog.Schema class - https://phabricator.wikimedia.org/T305491 (10phuedx) [09:34:49] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) [10:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [10:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:56:30] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9034367, @Manuel wrote: > Could you please add @karapayneWMDE to the parent group? If not, what would be required to do so? > (see {T284308} for referen... [12:12:32] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) The first of the cephosd hosts now has all OSDs active. There are 20 OSDs, numbered 0 to 19. ` btullis@cephosd1001:~$ sudo ceph osd tree ID CLASS WEIGHT T... [12:19:48] 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10Marostegui) @BTullis I was planning to depool the other via the normal haproxy puppet change. But I am happy to try other approaches if you want me to [13:29:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:58] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10karapayneWMDE) Request made: https://phabricator.wikimedia.org/T342546 [13:39:59] 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10BTullis) >>! In T334651#9037480, @Marostegui wrote: > @BTullis I was planning to depool the other via the normal haproxy puppet change. But I am happy to tr... [13:42:13] 10Data-Platform-SRE, 10Release Pipeline, 10ci-test-error: Limitations on CI fetching files from the wikimedia public datasets archive - https://phabricator.wikimedia.org/T341582 (10BTullis) Adding #data-platform-sre because I think that this might be something to do with us and an-web1001, from where these f... [13:43:46] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) All 100 OSD daemons are installed and running. `lines=10 root@cephosd1002:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-A... [13:48:29] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) Noting down some things to fix, while I think about it: * Need to install manually: `ceph-osd`, `ceph-volume`, hdparm` * Need to take a copy of `/var/lib/ceph/... [14:30:53] (03CR) 10Phuedx: [C: 03+1] Add mediawiki/cirrussearch/page-rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [14:35:43] 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10fnegri) > #cloud-services-team any objections from your side with this migration? I don't think we have any objections, cc @aborrero [14:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [14:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:15:00] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) a:03bking [15:20:39] 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage WDQS servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10bking) @MoritzMuehlenhoff Sorry for the delayed response. Some of these will be decommissioned per hardware refresh, see [[ https://docs.google.com/spreadsheets/d/1y3kh8JAYlb3... [15:20:52] 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage WDQS servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10Gehel) a:03bking [15:22:12] 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage wdqs20[13-22] servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10bking) [15:23:14] 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage wdqs20[13-22] servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10bking) Updated ticket message to make the AC more clear...now moving to "Needs Reporting" [15:26:43] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WCQS/WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10Gehel) Manual steps (see above) needs to be documented on wiki before we close this task. [15:27:20] 10Data-Platform-SRE, 10Discovery-Search: Test flink operations/failure scenarios relevant to Search Update Pipeline - https://phabricator.wikimedia.org/T342010 (10bking) a:03bking [15:29:42] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [15:29:47] 10Data-Platform-SRE, 10Discovery-Search: Test flink operations/failure scenarios relevant to Search Update Pipeline - https://phabricator.wikimedia.org/T342010 (10bking) 05Open→03Invalid [15:30:24] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) p:05Triage→03High [15:30:32] 10Data-Platform-SRE, 10Discovery-Search: Test flink operations/failure scenarios relevant to Search Update Pipeline - https://phabricator.wikimedia.org/T342010 (10bking) p:05Triage→03Low [15:30:44] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [15:31:16] 10Data-Platform-SRE, 10Release-Engineering-Team, 10Scap: "scap deploy"'s config-deploy should check for broken symlinks - https://phabricator.wikimedia.org/T342162 (10Gehel) [15:32:01] 10Data-Platform-SRE: Examine/refactor WDQS categories update scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) [15:33:42] 10Data-Engineering, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10Gehel) [15:35:04] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) - https://phabricator.wikimedia.org/T342463 (10Gehel) [15:37:08] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking) [15:43:50] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [17:22:18] 10Data-Engineering, 10Movement-Insights, 10Product-Analytics, 10Wmfdata-Python: Enable wmfdata-py to access MariaDB replicas on the cluster - https://phabricator.wikimedia.org/T340467 (10mpopov) [17:24:21] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Enable wmfdata-py to access MariaDB replicas on the cluster - https://phabricator.wikimedia.org/T340467 (10mpopov) @nshahquinn-wmf: Did you want to keep tabs on this on the Movement Insights board? [17:29:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:44:00] 10Quarry: Autocomplete for the database field shows invalid databases - https://phabricator.wikimedia.org/T342569 (10Novem_Linguae) [18:46:31] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10Novem_Linguae) I've had this bug for a couple weeks. Usually when opening quarry for the first time during that browsing session. A refresh fixes it. If I recall correctly, I've never had it happen twi... [18:52:13] (03CR) 10Milimetric: [C: 04-1] "-1 only because of the name (guideline is to use _ instead of -), everything else follows guidelines." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [18:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [18:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [19:08:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:14:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:19:23] 10Quarry: Autocomplete for the database field shows invalid databases - https://phabricator.wikimedia.org/T342569 (10Novem_Linguae) [19:28:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:29:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:54:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [19:58:38] xcollazo: Just a guess, but might these HDFS RPC call queue length alarms be something to do with you? You've got a pretty hefty job running here: https://yarn.wikimedia.org/cluster/app/application_1688722260742_86888 [20:02:40] I'm not too concerned by it, because the graph shows that it's not hugely overwhelmed, just that there's much more activity than usual. [20:19:33] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10bking) [20:19:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:20:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [20:25:06] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [21:25:46] 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10rook) [21:25:55] 10Quarry: Autocomplete for the database field shows invalid databases - https://phabricator.wikimedia.org/T342569 (10rook) [21:29:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:14] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) I've made two small CRs to suggest fixes for the errors mentioned in T330151#9037871 but there is one more fix that will require a little more thinking about. T... [21:47:23] (03PS1) 10Tsevener: Update schemas for iOS diff view changes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/941012 (https://phabricator.wikimedia.org/T341896) [22:01:48] 10Data-Platform-SRE, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10BTullis) [22:01:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [22:09:14] 10Data-Engineering, 10Product-Analytics, 10User-Iflorez: Use Hive/Spark timestamps in Refined event data - https://phabricator.wikimedia.org/T278467 (10Iflorez) [22:11:10] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) Moving this task to in-progress, so that I can use it to record the pool creation and the related crush rules. [22:15:18] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10BTullis) [22:31:59] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10BTullis) p:05Triage→03Medium This is not super-urgent to fix. It might be related to some work on the Iceberg migration by @xcollazo. I know that rec... [22:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [22:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [23:30:49] 10Data-Platform-SRE: Decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342035 (10RKemper)