[00:19:02] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:44] (SystemdUnitFailed) firing: (2) hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:20] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:44] (SystemdUnitFailed) firing: (2) hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [02:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [02:54:31] 10Data-Platform-SRE: Decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342035 (10RKemper) [05:33:05] 10Data-Platform-SRE: Decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342035 (10RKemper) [05:49:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:43] 10Data-Platform-SRE: Decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342035 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: `wdqs[2004-2006].codfw.wmnet` - wdqs2004.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager - Found physic... [06:11:19] 10Data-Platform-SRE: Decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342035 (10RKemper) Decom cookbook finished, and dc-ops ticket created (see ticket desc AC section for ticket #) [06:26:43] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) [06:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [06:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:19:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:15] o/ stevemunene, btullis would one of you have some time to do a hard reboot on wdqs1003.eqiad.wmnet? it's completely stuck. [08:50:59] o/ having a look dcausse [08:51:15] thanks! :) [09:00:03] stevemunene: oops this is wdqs1013.eqiad.wmnet not wdqs1003 (sorry about that!) [09:03:30] 10Data-Platform-SRE, 10DBA, 10cloud-services-team: Migrate wiki replicas (clouddb*) hosts to MariaDB 10.6 - https://phabricator.wikimedia.org/T334651 (10aborrero) >>! In T334651#9038046, @fnegri wrote: >> #cloud-services-team any objections from your side with this migration? > > I don't think we have any o... [09:06:23] aah thanks, it seems to be under unusually high cpu load [09:40:52] yes not sure what's happening there, it's not the first time we see such behavior there, I'd be tempted to blame an IO being stuck freezing the system completely [09:42:56] !log powercycle wdqs1013.eqiad.wmnet [09:42:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:44:10] btullis, stevemunene o/ - I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/941362 that is a no-op for kafka-jumbo [09:44:24] but the settings that we have are a little weird [09:44:33] <_joe_> milimetric: so vgutierrez and I took a look at T342577 and I think the "problem" is that if one of the generic ratelimits we have in vcl gets hit before we get to requestctl, no requestctl piece is set in X-Analytics [09:44:34] T342577: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577 [09:44:51] <_joe_> so either we move all of those rules to requestctl [09:45:50] 10Data-Engineering, 10EventStreams, 10Event-Platform: Make eventgate-analytics-external the default event service - https://phabricator.wikimedia.org/T342610 (10phuedx) [09:48:17] <_joe_> or we add a requestctl header to those ratelimits [09:56:11] stevemunene: the system seems back online and now catching up, thanks! [10:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [10:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:25:12] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) Tentatively moving this task to Done. Puppet now runs cleanly and Icinga is clean for these servers. There are still likely to be some changes to make regarding... [11:34:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:55] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) We now have raw storage available in this ceph cluster: ` btullis@cephosd1001:~$ sudo ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1010 TiB... [12:19:35] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1): ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10lbowmaker) [12:21:47] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [12:22:13] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [12:22:26] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0), 10Event-Platform (Sprint 14 B): jsonschema-tools test should fail if fields are removed in new (non major) version - https://phabricator.wikimedia.org/T340765 (10lbowmaker) 05Open→03Resolved [12:23:48] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich: alert on SLIs degradation only on active DC - https://phabricator.wikimedia.org/T342258 (10lbowmaker) [12:29:14] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform: mediawiki page_content_change should generate new meta.id field - https://phabricator.wikimedia.org/T341277 (10lbowmaker) [12:34:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:33] (03CR) 10Mazevedo: [C: 03+2] Update schemas for iOS diff view changes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/941012 (https://phabricator.wikimedia.org/T341896) (owner: 10Tsevener) [12:41:17] (03Merged) 10jenkins-bot: Update schemas for iOS diff view changes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/941012 (https://phabricator.wikimedia.org/T341896) (owner: 10Tsevener) [12:45:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:44] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:49] _joe_: makes sense to me, can I help with the implications of that decision? [13:20:56] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) Ah, it occurs to me that radosgw also makes use of [[https://docs.ceph.com/en/quincy/radosgw/pools|multiple pools]] anyway, so my previous comment about requiring a single pool for rad... [13:38:44] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:38:48] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) [13:38:57] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10Gehel) [13:38:59] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:55:45] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking) [14:14:29] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) I think I'll start by going with erasure coding for RBD on both device classes, using the values `k=3` and `m=2` This gives a 60% efficiency in storage usage, which is greater than eit... [14:20:22] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10MatthewVernon) FWIW, I've tended to view disk as cheap and complexity as expensive to have used replicated (n=3) in the past. [are you intending the RGW service to be general-purpose?] [14:21:35] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10MatthewVernon) ...but I'd be inclined to always put bucket indexes and suchlike on fast storage even if the objects themselves are on spinning disks. [14:41:34] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10RobH) [14:41:39] 10Data-Platform-SRE, 10DC-Ops, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10RobH) [14:45:55] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10xcollazo) This is definitely me: ` xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_iceberg.db/* Picked up JAVA_TOOL_OPTIONS: -Dfile... [14:49:51] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10RobH) [14:49:57] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10RobH) [14:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [14:53:34] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:10:09] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10xcollazo) Ran a couple of `hdfs dfs -rm -r -skipTrash`. Things look better now: ` xcollazo@stat1007:~$ hdfs dfs -count -v /user/hive/warehouse/xcollazo_i... [15:28:20] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10xcollazo) > I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out where the large numbers of files are located... [15:34:21] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10BTullis) a:03BTullis [15:47:31] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10Gehel) [15:50:18] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10Gehel) p:05Triage→03Medium [15:57:38] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10BTullis) >>! In T342587#9041730, @xcollazo wrote: >> I know that recently @JAllemandou and @Antoine_Quhen have done some work to allow us to find out whe... [15:59:43] 10Data-Platform-SRE: Alert: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://phabricator.wikimedia.org/T342587 (10BTullis) 05Open→03Resolved >>! In T342587#9041680, @xcollazo wrote: > Ran a couple of `hdfs dfs -rm -r -skipTrash`. Things look better now: > ` > xco... [16:49:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:33] 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) >>! In T326945#9041443, @MatthewVernon wrote: > > [are you intending the RGW service to be general-purpose?] > Thanks Matthew, I appreciate your viewpoint. > FWIW, I've tended to vi... [17:46:47] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Discovery-Search, 10Reading-Admin, and 4 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10KHernandez-WMF) [17:49:12] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Discovery-Search, 10Reading-Admin, and 4 others: Image Classification Research and Development - https://phabricator.wikimedia.org/T215413 (10Miriam) [18:53:28] (MediawikiPageContentChangeEnrichAvailability) firing: ... [18:53:28] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [19:28:50] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, and 2 others: Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10bking) Unfortunately, the package build is failing. We're following the process from [[... [19:34:42] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, and 2 others: Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10taavi) You seem to have added an extra space at the end of `BUILD_VERSION` in https://g... [20:49:44] (SystemdUnitFailed) firing: jupyter-dsaez-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:27] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking) [21:18:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking) [[ https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service#I've_got_a_new_WDQS_host,_how_do_I_get_it_ready_for_production? | Documentation has been updated, ]] bu... [21:22:20] 10Data-Platform-SRE: Ensure WCQS stack works on Bullseye or later - https://phabricator.wikimedia.org/T342701 (10bking) [22:53:43] (MediawikiPageContentChangeEnrichAvailability) firing: ... [22:53:43] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [23:45:14] 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics, 10Epic: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki) @BTullis Currently I am knee deep in OKR SDS1.1 work and providing baselines for other OKRs but...