[01:52:57] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:57] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:36] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye [06:29:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye [06:51:54] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye completed: - an-worker1130 (**PASS**) - Downtimed on Icinga/Alertm... [06:59:13] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye [07:08:26] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye completed: - an-worker1131 (**PASS**) - Downtimed on Icinga/Alertm... [07:23:36] (03CR) 10Phuedx: Remove unused schemas (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight) [07:40:09] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10fgiunchedi) [07:45:29] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye executed with errors: - an-worker1132 (**FAIL**) - Downtimed on Ic... [07:47:52] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10elukey) >>! In T344688#9140427, @BTullis wrote: > I have prepared a patch to update... [09:25:25] 10Data-Engineering, 10EventStreams, 10Event-Platform: Wik - https://phabricator.wikimedia.org/T345606 (10Count_Count) [09:36:37] 10Data-Engineering, 10EventStreams, 10Event-Platform: Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10Count_Count) [09:52:57] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:03] 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10BTullis) 05Open→03Resolved Tentatively resolving this issue, although I will reopen it if we see the same behaviour again. [10:35:32] !log Clear airflow false-failed tasks for pageview_hourly (log-aggregation issue) [10:35:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:35:46] !log Rerun cassandra_load_pageview_top_articles_monthly [10:35:46] Schedule: @monthly info Next Run: 2023-09-01, 00:00:00 [10:35:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:39:00] joal: Was this log aggregation similar to what we saw before? https://wikitech.wikimedia.org/wiki/Incidents/2023-08-30_hadoop-yarn [10:39:13] (log-aggregation issue) [10:39:32] btullis: yes, I'm cleaning the leftovers from when we experienced the issue [10:39:56] Ah, got it, thanks. I was just checking that it wasn't still happening. [10:40:35] No problem at all btullis :) [10:43:50] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) 05Open→03Resolved There was one incident that resulted from this change: https://wikitech.wikimedia.org/wiki/Incidents/2023-08-30_hadoop-yarn In short,... [10:45:10] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) [12:02:03] 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) a:03brouberol [12:05:44] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10WDoranWMF) We should add a short section explaining and linking to the known dumps issues: https://wikitech.wikimedia.org/wiki/User:ArielGlenn/dumps_issues [12:05:56] 10Data-Platform-SRE, 10Data-Catalog: Errors from datahub relating to the search indices - https://phabricator.wikimedia.org/T345616 (10BTullis) [12:06:15] 10Data-Platform-SRE, 10Data-Catalog: Errors from datahub relating to the search indices - https://phabricator.wikimedia.org/T345616 (10BTullis) p:05Triage→03Unbreak! [12:19:08] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10dcaro) Probably yes, just had a quick look at the logs and seems a different issue: ` Sep 05 12:17:57 quarry-web-02 uwsgi-quarry-web[1780]: [2023-09-05 12:17:57,240] ERROR in app: Exception on / [GET]... [12:34:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) an-worker1132 seems to be stuck on debian Install as seen below. power cycling the server and retrying the reimage. {F37656999} [12:35:57] !log power cycle an-worker1132. Host is stuck on debian install after a failed reimage. [12:35:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:44:59] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye [13:02:25] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9133002, @elukey wrote: > Hi folks! Yes I'd follow what we did for `analytics-product` etc.. since we'll create the same system user (uid/gid) across nod... [13:24:50] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Event-Platform: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) 05Open→03Resolved [13:29:29] (03CR) 10Mforns: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [13:41:25] 10Data-Platform-SRE, 10Data-Catalog: Errors from datahub relating to the search indices - https://phabricator.wikimedia.org/T345616 (10BTullis) p:05Unbreak!→03High I deleted the deployment from codfw and then re-deployed. This ran the `datahub-main-system-update-job` and `datahub-main-nocode-migration-job`... [13:46:45] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) a:03BTullis [13:48:13] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:52:37] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [13:52:57] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:07:15] (03CR) 10Btullis: Use sudo with git in refinery_deploy_to_hdfs (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:09:37] I am planning to do an eventstreams deploy today, since I have just merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/950194 [14:15:07] !log deploying eventstreams-internal for T344688 [14:15:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:15:10] T344688: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 [14:23:14] !log deploying eventstreams for T344688 [14:23:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:23:18] T344688: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 [14:26:47] !log completed eventstreams and eventstreams-internal deployments. [14:26:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:34:51] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Jclark-ctr) [14:36:22] 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) ` 2023-08-27 00:34:10 Disk 3 in Backplane 1 of Storage Controller in SL 3 is removed.` I can assure you that it wasn't physically removed. Local time, that was a Sa... [14:38:01] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > If you're currently rolling your own yarn assembly file, can't you simply upload it to the cluster anyway, using your sudo -u hdfs privileges?... [14:38:44] 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) a:03Jhancock.wm [14:41:18] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics, 10Wmfdata-Python, 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) [14:59:45] 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Vgutierrez) I don't think so as it's still using role `insetup::search_platform` but @bking and @RKemper should have more context about it [15:06:48] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > Given this conversation, I agree this should be a separate task. Will open it and CC you. I should not comment before coffee. I had already op... [15:07:31] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) [15:09:40] 10Data-Platform-SRE, 10SDC General, 10Wikidata, 10Wikidata-Query-Service: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882 (10bking) Hello, I've fixed wdqs1003 as well and it appears we have [[ https://grafana.wikimedia.org/d/000000489/wikidata-... [15:24:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:43] (SystemdUnitFailed) resolved: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:12] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10OSefu-WMF) @BTullis - Looks like that fixed it! Thanks. [15:37:44] 10Data-Platform-SRE, 10SDC General, 10Wikidata, 10Wikidata-Query-Service: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882 (10Gehel) 05Open→03Resolved a:03Gehel [15:43:59] 10Data-Platform-SRE: an-worker1145: soft lockup. - https://phabricator.wikimedia.org/T345413 (10Gehel) 05Open→03Resolved Server has been behaving correctly since last reboot, let's close. [15:57:11] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10BTullis) 05Open→03Stalled p:05High→03Medium There is ongoing discussion about the possible deprecation of Hue happening [[https://wikimedia.slack.com/archives/CLKDS4MG9/p169272831... [16:20:31] 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 55 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [16:30:35] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - https://phabricator.wikimedia.org/T343823 (10mpopov) I have a radical idea to not use conda-analytics for R //at all//. What's the most important thing about conda-analytics? The fact that environments can b... [16:39:02] (03PS1) 10Btullis: Increase the max kafka message size for gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/954968 (https://phabricator.wikimedia.org/T307959) [18:02:53] (03CR) 10Milimetric: [C: 03+1] "LGTM, just a thought on testing" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal) [18:03:50] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) As Wikidata's Analytics Product Manager I am not focused on the technical engineering aspects. But let me still try to provide some context that might be helpful: * The gene... [18:13:47] (03Abandoned) 10Clare Ming: Add Metrics Platform fragments by entity, platform [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [18:17:14] (03PS9) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) [18:19:44] (03CR) 10Clare Ming: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [18:55:58] (03CR) 10Peter Fischer: "This change is ready for review." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [18:57:18] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) >>! In T344688#9141474, @elukey wrote: >> If I understand it correctly, t... [19:02:45] (03PS3) 10Peter Fischer: Reuse existing schema fragments for redirects. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) [19:09:24] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) >>! In T344688#9128543, @gmodena wrote: > @BTullis @elukey re the MW cont... [19:52:27] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10AKanji-WMF) [20:45:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:52] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:26] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:52] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bo... [21:23:38] 10Data-Platform-SRE, 10Discovery-Search: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10bking) [21:24:05] 10Data-Platform-SRE, 10Discovery-Search: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10bking) [22:22:39] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookwo... [23:13:34] 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10colewhite) [23:44:39] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk2001.codfw.wmnet` - flink-zk200...