[01:52:57] <jinxer-wm>	 (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:52:57] <jinxer-wm>	 (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:10:36] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye
[06:29:44] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye
[06:51:54] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye completed: - an-worker1130 (**PASS**)   - Downtimed on Icinga/Alertm...
[06:59:13] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye
[07:08:26] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye completed: - an-worker1131 (**PASS**)   - Downtimed on Icinga/Alertm...
[07:23:36] <wikibugs>	 (03CR) 10Phuedx: Remove unused schemas (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/755724 (owner: 10Awight)
[07:40:09] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10fgiunchedi)
[07:45:29] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye executed with errors: - an-worker1132 (**FAIL**)   - Downtimed on Ic...
[07:47:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10elukey) >>! In T344688#9140427, @BTullis wrote: > I have prepared a patch to update...
[09:25:25] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Wik - https://phabricator.wikimedia.org/T345606 (10Count_Count)
[09:36:37] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Event-Platform: Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10Count_Count)
[09:52:57] <jinxer-wm>	 (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:23:03] <wikibugs>	 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10BTullis) 05Open→03Resolved Tentatively resolving this issue, although I will reopen it if we see the same behaviour again.
[10:35:32] <joal>	 !log Clear airflow false-failed tasks for pageview_hourly (log-aggregation issue)
[10:35:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:35:46] <joal>	 !log Rerun cassandra_load_pageview_top_articles_monthly
[10:35:46] <joal>	 Schedule: @monthly info Next Run: 2023-09-01, 00:00:00
[10:35:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:39:00] <btullis>	 joal: Was this log aggregation similar to what we saw before? https://wikitech.wikimedia.org/wiki/Incidents/2023-08-30_hadoop-yarn
[10:39:13] <btullis>	 (log-aggregation issue)
[10:39:32] <joal>	 btullis: yes, I'm cleaning the leftovers from when we experienced the issue
[10:39:56] <btullis>	 Ah, got it, thanks. I was just checking that it wasn't still happening.
[10:40:35] <joal>	 No problem at all btullis :)
[10:43:50] <wikibugs>	 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) 05Open→03Resolved There was one incident that resulted from this change: https://wikitech.wikimedia.org/wiki/Incidents/2023-08-30_hadoop-yarn In short,...
[10:45:10] <wikibugs>	 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis)
[12:02:03] <wikibugs>	 10Data-Platform-SRE: Write a cookbook for rolling reboot/restart of datahubsearch servers - https://phabricator.wikimedia.org/T344798 (10brouberol) a:03brouberol
[12:05:44] <wikibugs>	 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10WDoranWMF) We should add a short section explaining and linking to the known dumps issues: https://wikitech.wikimedia.org/wiki/User:ArielGlenn/dumps_issues
[12:05:56] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Errors from datahub relating to the search indices - https://phabricator.wikimedia.org/T345616 (10BTullis)
[12:06:15] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Errors from datahub relating to the search indices - https://phabricator.wikimedia.org/T345616 (10BTullis) p:05Triage→03Unbreak!
[12:19:08] <wikibugs>	 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10dcaro) Probably yes, just had a quick look at the logs and seems a different issue: ` Sep 05 12:17:57 quarry-web-02 uwsgi-quarry-web[1780]: [2023-09-05 12:17:57,240] ERROR in app: Exception on / [GET]...
[12:34:43] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) an-worker1132 seems to be stuck on debian Install as seen below. power cycling the server and retrying the reimage. {F37656999}
[12:35:57] <stevemunene>	 !log power cycle an-worker1132. Host is stuck on debian install after a failed reimage.
[12:35:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:44:59] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye
[13:02:25] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9133002, @elukey wrote: > Hi folks! Yes I'd follow what we did for `analytics-product` etc.. since we'll create the same system user (uid/gid) across nod...
[13:24:50] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Machine-Learning-Team, 10Event-Platform: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) 05Open→03Resolved
[13:29:29] <wikibugs>	 (03CR) 10Mforns: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[13:41:25] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Errors from datahub relating to the search indices - https://phabricator.wikimedia.org/T345616 (10BTullis) p:05Unbreak!→03High I deleted the deployment from codfw and then re-deployed. This ran the `datahub-main-system-update-job` and `datahub-main-nocode-migration-job`...
[13:46:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10BTullis) a:03BTullis
[13:48:13] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:52:37] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[13:52:57] <jinxer-wm>	 (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:07:15] <wikibugs>	 (03CR) 10Btullis: Use sudo with git in refinery_deploy_to_hdfs (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:09:37] <btullis>	 I am planning to do an eventstreams deploy today, since I have just merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/950194 
[14:15:07] <btullis>	 !log deploying eventstreams-internal for T344688
[14:15:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:15:10] <stashbot>	 T344688: Increase Max Message Size in Kafka Jumbo  - https://phabricator.wikimedia.org/T344688
[14:23:14] <btullis>	 !log deploying eventstreams for T344688
[14:23:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:23:18] <stashbot>	 T344688: Increase Max Message Size in Kafka Jumbo  - https://phabricator.wikimedia.org/T344688
[14:26:47] <btullis>	 !log completed eventstreams and eventstreams-internal deployments.
[14:26:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:34:51] <wikibugs>	 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Jclark-ctr)
[14:36:22] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) ` 2023-08-27 00:34:10  Disk 3 in Backplane 1 of Storage Controller in SL 3 is removed.`  I can assure you that it wasn't physically removed. Local time, that was a Sa...
[14:38:01] <wikibugs>	 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > If you're currently rolling your own yarn assembly file, can't you simply upload it to the cluster anyway, using your sudo -u hdfs privileges?...
[14:38:44] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) a:03Jhancock.wm
[14:41:18] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics, 10Wmfdata-Python, 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis)
[14:59:45] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Vgutierrez) I don't think so as it's still using role `insetup::search_platform` but @bking and @RKemper should have more context about it
[15:06:48] <wikibugs>	 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > Given this conversation, I agree this should be a separate task. Will open it and CC you. I should not comment before coffee.  I had already op...
[15:07:31] <wikibugs>	 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo)
[15:09:40] <wikibugs>	 10Data-Platform-SRE, 10SDC General, 10Wikidata, 10Wikidata-Query-Service: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882 (10bking) Hello, I've fixed wdqs1003 as well and it appears we have [[ https://grafana.wikimedia.org/d/000000489/wikidata-...
[15:24:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:27:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:37:12] <wikibugs>	 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10OSefu-WMF) @BTullis - Looks like that fixed it! Thanks.
[15:37:44] <wikibugs>	 10Data-Platform-SRE, 10SDC General, 10Wikidata, 10Wikidata-Query-Service: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882 (10Gehel) 05Open→03Resolved a:03Gehel
[15:43:59] <wikibugs>	 10Data-Platform-SRE: an-worker1145: soft lockup. - https://phabricator.wikimedia.org/T345413 (10Gehel) 05Open→03Resolved Server has been behaving correctly since last reboot, let's close.
[15:57:11] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10BTullis) 05Open→03Stalled p:05High→03Medium There is ongoing discussion about the possible deprecation of Hue happening [[https://wikimedia.slack.com/archives/CLKDS4MG9/p169272831...
[16:20:31] <wikibugs>	 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 55 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)
[16:30:35] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - https://phabricator.wikimedia.org/T343823 (10mpopov) I have a radical idea to not use conda-analytics for R //at all//. What's the most important thing about conda-analytics? The fact that environments can b...
[16:39:02] <wikibugs>	 (03PS1) 10Btullis: Increase the max kafka message size for gobblin [analytics/refinery] - 10https://gerrit.wikimedia.org/r/954968 (https://phabricator.wikimedia.org/T307959)
[18:02:53] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] "LGTM, just a thought on testing" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/953198 (https://phabricator.wikimedia.org/T344071) (owner: 10Joal)
[18:03:50] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) As Wikidata's Analytics Product Manager I am not focused on the technical engineering aspects. But let me still try to provide some context that might be helpful: * The gene...
[18:13:47] <wikibugs>	 (03Abandoned) 10Clare Ming: Add Metrics Platform fragments by entity, platform [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[18:17:14] <wikibugs>	 (03PS9) 10Clare Ming: Add Metrics Platform fragments by platform only [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557)
[18:19:44] <wikibugs>	 (03CR) 10Clare Ming: Add Metrics Platform fragments by platform only (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming)
[18:55:58] <wikibugs>	 (03CR) 10Peter Fischer: "This change is ready for review." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer)
[18:57:18] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) >>! In T344688#9141474, @elukey wrote:  >> If I understand it correctly, t...
[19:02:45] <wikibugs>	 (03PS3) 10Peter Fischer: Reuse existing schema fragments for redirects. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/951829 (https://phabricator.wikimedia.org/T325315)
[19:09:24] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10gmodena) >>! In T344688#9128543, @gmodena wrote: > @BTullis @elukey  re the MW cont...
[19:52:27] <wikibugs>	 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10AKanji-WMF)
[20:45:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:47:52] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:56:26] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:08:52] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bo...
[21:23:38] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10bking)
[21:24:05] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Find/fix logstash logging for rdf-streaming-updater - https://phabricator.wikimedia.org/T345668 (10bking)
[22:22:39] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookwo...
[23:13:34] <wikibugs>	 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10colewhite)
[23:44:39] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk2001.codfw.wmnet` - flink-zk200...