[04:47:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:48] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 3 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10kai.nissen) [09:14:48] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye [09:30:20] 10Data-Platform-SRE, 10ops-codfw: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Vgutierrez) [09:42:58] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye executed with errors: - an-worker1129 (**FAIL**) - Downtimed on Ic... [09:44:42] (SystemdUnitFailed) firing: hadoop-hdfs-datanode.service Failed on an-worker1129:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:42] (SystemdUnitFailed) resolved: hadoop-hdfs-datanode.service Failed on an-worker1129:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:01] 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10BTullis) [10:08:18] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) >>! In T332570#9139479, @ops-monitoring-bot wrote: > Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye executed with errors:... [10:10:49] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) 05Resolved→03Open [10:13:41] 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10BTullis) p:05Triage→03Unbreak! Thanks to @JMeybohm and @fgiunchedi for reporting the issue. I will look into this issue with high priority. [10:23:54] stevemunene: I'm still with Oscar, I won't be able to join the incident review [10:24:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:32] PROBLEM - Presto Server on an-presto1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [10:29:23] Ack gehel say Hi to Oscar [10:49:16] RECOVERY - Presto Server on an-presto1002 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [10:54:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-presto1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:51] 10Analytics-Radar, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10phuedx) 05Open→03Resolved a:03phuedx Being **bol... [11:06:51] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) 1 metric that could have been useful was the number of task retries. [11:26:29] 10Analytics-Radar, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10daniel) It would be nice to resolve the underlying des... [11:35:12] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) I am going to reimage an-presto1002 to see if we can isolate whether or not it is hardware or software related. There is no particular state of interest on the server, other than log files. ` btullis@a... [11:35:45] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1002.eqiad.wmnet with OS bullseye [11:35:46] (03CR) 10Gmodena: [C: 03+1] cirrussearch: add fetch_failure schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [12:03:59] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) Hi @nshahquinn-wmf - I do know how we can make this work and I think we have a patch ready to go. However... [12:49:01] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) > Should we also enable compression on jumbo? Or would you rather our prod... [12:52:23] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1002.eqiad.wmnet with OS bullseye completed: - an-presto1002 (**WARN**) - Downtimed on Icinga/Alertmanag... [12:55:59] btullis: I'm hijacking our 1:1 tomorrow to have a chat with Balthazar on a first task [13:02:42] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:15] (03CR) 10Phuedx: [C: 04-1] "-1 for your attention only." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/951191 (https://phabricator.wikimedia.org/T343557) (owner: 10Clare Ming) [13:10:24] gehel - Absolutely fine by me. [13:13:07] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043 (10Novem_Linguae) Had my issue happen again. As a reminder, my issue is that on the first visit to Quarry in that browsing session, I usually get an error 500. One refresh always fixes it. Should I file... [13:49:51] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) The reimage was fine, but now we are back to having a missing python interpreter. ` btullis@an-presto1002:~$ systemctl list-units --state failed UNIT LOAD ACTIVE SUB DESCRIPTIO... [13:52:22] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) Begrudgingly adding it manually to this server to fix the immediate issue, but we should come back to this. ` btullis@an-presto1002:~$ sudo apt install python-is-python3 Reading package lists... Done B... [13:52:42] (SystemdUnitFailed) firing: (2) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:06] 10Data-Platform-SRE: Investigate an-presto1002 failures - https://phabricator.wikimedia.org/T344808 (10BTullis) Moved to Blocked/Waiting whilst we monitor for stability. [14:11:09] (03PS12) 10Phuedx: Add analytics/metrics_platform/{app,web}/{click,view} schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [14:16:38] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) I have prepared a patch to update the eventstream configuration to consume... [14:22:33] 10Data-Platform-SRE, 10SDC General, 10Wikidata, 10Wikidata-Query-Service: Some servers for the Commons query service (WCQS) are missing data - https://phabricator.wikimedia.org/T344882 (10Gehel) [14:44:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data... [14:49:50] (03PS2) 10Aqu: WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) [14:56:15] (03CR) 10CI reject: [V: 04-1] WIP: Create a job to dump XML/SQL MW history files to HDFS [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/938941 (https://phabricator.wikimedia.org/T335862) (owner: 10Aqu) [14:58:39] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 (10Gehel) It looks like there is an issue with the SSL cert on data.nlg.gr. Also, this server seems to requi... [15:01:19] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10Gehel) 05Open→03Resolved [15:01:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Move whitelist.txt from WDQS deploy repo into puppet and rename it to "allow list" - https://phabricator.wikimedia.org/T343856 (10Gehel) Tests on T337296 indicate that this change is successful. There are other issues in T337296... [15:07:21] 10Data-Platform-SRE: DataHub staging MAE consumer is spamming logstash - https://phabricator.wikimedia.org/T345550 (10BTullis) p:05Unbreak!→03High I've been unable to replicate this so far. Once the misbehaving pod was killed it was automatically restarted, but the replacement pood worked without issue. Wha... [15:13:36] 10Data-Platform-SRE: Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) This is now looking good. After making progress with {T305874} and being able to test in staging, it would appear that //just-in-time// provisioning of user accounts with their CN i... [15:16:55] 10Data-Platform-SRE, 10Data-Catalog: DataHub rights assignment is case-sensitive - https://phabricator.wikimedia.org/T309382 (10BTullis) This is looking good for closure, once we can promote the change in {T305874} to production. The users will log into the https://idp.wikimedia.org using their wikitech usern... [15:21:24] 10Data-Platform-SRE: Rolling operation cookbook: Detect and remove failed index aliases - https://phabricator.wikimedia.org/T345449 (10Gehel) [15:21:28] 10Data-Platform-SRE: Rolling operation cookbook: Detect and remove failed index aliases - https://phabricator.wikimedia.org/T345449 (10Gehel) [15:50:45] 10Data-Platform-SRE: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) >>! In T344910#9135594, @xcollazo wrote: >> I have created a patch to build docker images of Spark version 3.3.3 > As of this writing, `pyspark=3.... [16:25:09] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) I went through the comparing the permissions of the `sql_lab` role and the `WMF Analyst` role and I found one discrepancy that was not mentioned in the [[... [17:52:42] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:42] (SystemdUnitFailed) firing: drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:49] I just noticed the nice circle annotations at https://stats.wikimedia.org/#/all-projects/reading/total-page-views/normal|bar|all|~total|monthly [23:15:23] Are those new? I vaguely recall these used to be instead of annotated through letters A-Z in the corner. I love these :) [23:15:43] The arrow is cute.