[00:02:26] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-fetchimage.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:32] 10Data-Engineering: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10JAllemandou) [09:33:31] Hi btullis - I'm receiving alerts from the test cluster (oozie job failures) - Have we made any change on this lately? [09:35:15] joal: Nothing as far as I'm aware. [09:35:28] ack btullis - I'll take a look [09:36:17] also btullis, it's the second day we're facing silent-tasks-failure on airflow - I'll try to look at logs/disk-space, but our view would be welcome [09:36:18] joal: Are you more concerned about the fact that alerts are generated at all on the test cluster, or by the jobs on the test cluster failing? [09:36:44] btullis: No concern about the test cluster - it's alerts that come to my box, and I could do without them [09:37:25] joal: ack on both of those points. Are we expecting email alerts from Airflow, but didn't get them? [09:41:41] hello folks, I am testing the kafka reboot cookbook on kafka test, as FYI [09:41:53] absolutely btullis - this is a known issue (I think we have a task, I'm looking for it) [09:41:54] not sure if it is related with you errors joal [09:42:16] elukey: my oozie errors could be related indeed - thanks for letting me know :) [09:51:25] btullis: Thanks for the notice on the error. I will prepare a patch. I suppose it is a problem of escaping. [09:58:27] aqu: Yes, I thought I'd bring it to your attention straight away, but I haven't begun to look much at a solution. This might help: https://www.freedesktop.org/software/systemd/man/systemd.service.html#Command%20lines [10:03:45] !log Restart failed airflow tasks [10:03:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:09:52] btullis: no log in failed airflow tasks [10:14:16] I see that we have some logs from `airflow-scheduler` including stack traces in `/var/log/syslog` on an-launcher1002. Like this: [10:14:21] https://www.irccloud.com/pastebin/zDeXM3hr/ [10:14:51] I would have thought that we want to have these going to their own log files, rather than just the catch-all syslog. [10:15:24] eah, I'm looking for logs on an-launcher1002, and can't find them easily [10:16:28] I think you should be able to do `sudo journalctl -u airflow-scheduler` and similar. [10:17:18] Sorry that should read: `sudo journalctl -u airflow-scheduler@analytics.service` (tab complete helps too) [10:18:05] yup [10:18:19] I see we have a few like this: `MySQLdb._exceptions.OperationalError: (1040, 'Too many connections')` [10:20:27] ok, there were errors yesterday at 17:00UTC, and then again at 00:00UTC today: MySQLdb._exceptions.OperationalError: (1040, 'Too many connections') [10:20:55] this feels exactly the type of problems that could lead to silent error :( [10:21:43] I assume we're gonna wait for the upgrade, and see how postgres behaves before taking any action [10:21:50] btullis: --^ ? [10:23:13] I'm checking to see what might have caused the too many connections errors: Here are some stats from around 17:00 yesterday [10:23:14] https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1&from=1671036742202&to=1671038162325 [10:23:44] I wonder if it could be related to aqu jobs (many failure in logs related to this) [10:24:38] Sudden jump to 1.5K qps at around 17:06 [10:24:41] https://usercontent.irccloud-cdn.com/file/3vDNBOxZ/image.png [10:27:13] There are connection problems from the mysql perspective at both 17:00 and 00:00 (last 24h) [10:27:22] In order to mitigate it we might lift the max connections in MariaDB whilst we proceed with the PostgreSQL migration. [10:28:27] While I see how this would mitigate our issue, I have no clue as to why this happens now and not before, and if changing the parameter could impact other things negatively [10:30:58] No, I agree. We need to find the cause of the increased MariaDB traffic. [10:31:59] At least I was on leave yesterday, so I don't think it was me :-) [10:32:16] :) [10:32:48] aqu: may I put your airflow in pause while it's not et ok? [10:34:15] @joal: Sure [10:34:26] ack aqu - doing it now [10:43:33] joal: There were some substantial presto queries going on at around 17:00 yesterday, Can't see how this would affect the MariaDB QPS though: https://grafana-rw.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1671033376026&to=1671041056026 [10:44:18] interesting [10:44:28] Oh right, yes it could, if it was querying the metastore, right? [10:44:31] btullis: I think presto connects to mysql for hive metastore [10:44:34] yup [10:50:35] OK, we might be on to something here then. That doesn't explain the spike of QPS just after midnight, but I'd guess that this more likely related to standard spark/hive pipelines hitting the Hive metastore too. [10:55:56] It looks to me like we've been very close to the max_connections value of 250 for a long time: https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1&from=now-30d&to=now&viewPanel=9&refresh=1m [10:56:32] We've been creeping nearer to it over the past 7 days, then these two burst of connections just pushed it over the 250 limit. [10:59:12] https://usercontent.irccloud-cdn.com/file/A6Nzeh6n/image.png [11:00:40] So I think that increasing the max_connections value is a good idea. [11:01:33] It will require more RAM on an-coord1001 but looking at this shows that quite a bit is cached. We can effectively earmark some more of this for MariaDB connections: https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics&viewPanel=4 [11:03:16] ack btullis, makes sense - thanks for trouble shooting :) [11:03:44] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search, 10Event-Platform Value Stream: EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state - https://phabricator.wikimedia.org/T325273 (10dcausse) [11:07:09] joal: I'll make a ticket now. Changing max connection is possible in runtime so we can do it without restarting, but I'll also create a puppet CR to change it in the config: https://github.com/wikimedia/puppet/blob/production/modules/profile/templates/analytics/database/meta/analytics-meta.my.cnf.erb#L61 [11:41:17] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) [11:46:03] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) p:05Triage→03High [11:51:31] 10Data-Engineering-Planning: Airflow does not send SLA emails nor update SLA misses in the db - https://phabricator.wikimedia.org/T314181 (10BTullis) We have been struck by this issue again. Jobs were failing due to MySQL connection errors, but no emails were sent. {T325278} [12:00:09] (03PS6) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [12:11:05] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) I believe that it is safe to increase this limit from 250 to 350, based on the amount of c... [12:13:01] Log increasing max_connections for mariadb on an-coord1001 from 250 to 350 (T325278) [12:13:01] T325278: Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 [12:13:47] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) Increased the value at runtime: ` btullis@an-coord1001:~$ sudo mysql Welcome to the MariaD... [12:13:55] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10KCVelaga_WMF) Last set of updates required to the affiliate tenure: Per team consensus, `total_affiliate_tenure` is split into two metrics - `affiliate_tenure_max`: MAX of affiliate tenure of affilia... [12:14:47] (03CR) 10Milimetric: [V: 03+2 C: 03+2] GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 (owner: 10Nmaphophe) [12:17:35] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) Confirmed that the increase in `max_connections` is visible in Grafana too. {F35865770,wid... [12:17:52] Log increasing max_connections for mariadb on an-coord1002 from 250 to 350 (T325278) [12:18:39] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) Changed the value on an-coord1002 ` btullis@an-coord1002:~$ sudo mysql Welcome to the Mari... [12:25:53] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Increase max_connections for MariaDB on an-coord hosts - https://phabricator.wikimedia.org/T325278 (10BTullis) Merged and applied the puppet config change on both an-coord100[1-2] ` Notice: /Stage[main... [12:31:21] 10Data-Engineering-Radar, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Clement_Goubert) You're not, much thanks :D [12:36:31] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmn... [12:56:22] btullis: Here is a fix https://gerrit.wikimedia.org/r/c/operations/puppet/+/868397, I've tried it on docker, not perfect but better than shooting in the dark. [13:03:23] aqu: Looks good at first glance. I'll merge and deploy after lunch. [13:34:06] 10Data-Engineering: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10Ottomata) Mind if I hijack this task and we do this for wikimedia-event-utilities at the same time? Oh, there is one more nice thing guava has that we use though: [[ https://gerrit.wikimedia.org/r... [13:34:54] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet w... [13:48:45] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) kafka-stretch1001 worked ok with the new raid config. I'm just going to rebuild kafka-stretch1002 because although t... [13:51:32] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmn... [13:53:26] (03PS2) 10Matthias Mullie: Modify SearchPreview action to align with requirements [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/868187 (https://phabricator.wikimedia.org/T321069) (owner: 10Simone Cuomo) [13:53:43] (03CR) 10Matthias Mullie: [C: 04-2] "DNM for now; checking whether more work remains to be done" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/868187 (https://phabricator.wikimedia.org/T321069) (owner: 10Simone Cuomo) [13:54:11] (03CR) 10CI reject: [V: 04-1] Modify SearchPreview action to align with requirements [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/868187 (https://phabricator.wikimedia.org/T321069) (owner: 10Simone Cuomo) [13:57:35] (03Abandoned) 10Matthias Mullie: Add click-snippet to searchpreview's action enum [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/868095 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [14:06:23] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:37] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet w... [14:47:05] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) [14:54:14] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) 05Open→03Resolved I think that there are all done now. [14:58:55] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) Now kafka-stretch2001 is the only one of these four kafka-stretch hosts left with the drive order reversed. ` btullis... [14:59:43] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmn... [15:43:39] a-team: I am having a little trouble with the interlanguage_navigation data. When I try to query it, I unconditionally get the error `presto error: Partition location does not exist: hdfs://analytics-hadoop/wmf/data/wmf/interlanguage/navigation/daily/date=2022-5-9`. I am querying the presto_analytics_hive DB. Can you advise? [15:44:16] This error arises even if I try to exclude that date, e.g. by querying only for December 2022. [15:52:25] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: Productionize PyFlink Enrichment Service - https://phabricator.wikimedia.org/T325303 (10lbowmaker) [15:53:36] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: Deploy to YARN - https://phabricator.wikimedia.org/T325304 (10lbowmaker) [15:53:43] 10Data-Engineering, 10Event-Platform Value Stream: Deploy to YARN - https://phabricator.wikimedia.org/T325304 (10lbowmaker) [15:53:46] apine: looking [15:54:45] 10Data-Engineering, 10Event-Platform Value Stream: Deploy to DSE k8s - https://phabricator.wikimedia.org/T325305 (10lbowmaker) [15:55:27] 10Data-Engineering, 10Event-Platform Value Stream: Deploy to production k8s - https://phabricator.wikimedia.org/T325307 (10lbowmaker) [15:57:31] milimetric: thank you! [15:58:53] apine: looks like when we migrated this to airflow we accidentally forgot to pad the month/day, leading to some bad partitions. I'm going to try to clean them up and see if the problem goes away [16:02:51] milimetric: Thank you! I am actually thinking that I need to run a query over pageview_actor directly, very similar to this one https://github.com/wikimedia/analytics-refinery/blob/master/oozie/interlanguage/daily/interlanguage_navigation.hql. I can't run it in Superset. I assume I need access to the Hive cluster? Can you point me to docs on how to get started there? [16:03:06] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet w... [16:03:18] apine: problem with interlanguage_navigation should be fixed now, please try and let me know [16:04:31] apine: the select part of that query with the parameters filled in should work, let me test superset's sqllab [16:05:00] milimetric: `interlanguage_navigation` is now working, thank you! The thing that doesn't work on Superset is `parse_url`, it seems [16:08:19] milimetric: hmm, I was wrong. `interlanguage_navigation` is querying "successfully," but not turning up any results. Sorry for the spam! [16:14:22] apine: yeah, there were quite a few changes that I had to make because we don't have the Hive UDFs in Presto (normalize_host and the built-in parse_url). So here, this is a version that's close to that query and runs in presto: [16:14:24] https://www.irccloud.com/pastebin/xKd0aAs2/ [16:14:25] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) WOWWW THANK YOU! [16:14:49] apine: you can fork/play here: https://superset.wikimedia.org/superset/sqllab/?savedQueryId=616 [16:14:56] milimetric: wowwwwwww, fantastic! thank you so much!!! [16:15:25] np, this kind of stuff can be unnecessarily annoying to start with, but useful once you get started [16:16:37] (apine note: I'm not at all sure my reproduction is accurate, I left out some filters and didn't check the data at all, but I'm sure you'll be changing the query anyway, so that's just to get you started. It sounds like you have access to query, it was just the functions from Hive that were throwing errors, but let me know otherwise) [16:20:27] milimetric: Nope, no otherwise; that's all correct! And `url_extract_host` works well for my purposes :). Much gratitude! [16:42:01] yay :) [16:43:55] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) OK, I think that both of these two hosts are set up correctly now. The failure in the cookbook above was only a delay... [16:59:24] (03PS7) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [17:03:04] (03PS8) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [17:06:04] thanks a lot milimetric for finding the interlanguage issue! [17:07:25] milimetric: I guess we need to fix our airflow job, right? [17:15:09] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) YAYYY [19:01:53] (03CR) 10Milimetric: [V: 03+2 C: 03+2] GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 (owner: 10Nmaphophe) [19:02:53] joal: no, I don't think so, someone fixed it in early May, so it was just a cleanup issue, should be good going forward [19:08:25] !log Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance. [19:08:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:29:15] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1746 MB (3% inode=84%): /tmp 1746 MB (3% inode=84%): /var/tmp 1746 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [20:30:45] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1741 MB (3% inode=84%): /tmp 1741 MB (3% inode=84%): /var/tmp 1741 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [21:36:06] looking into this! [22:28:32] !log run `sudo apt clean` on an-coord1001 [22:28:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:33:14] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [22:55:44] 10Data-Engineering, 10Metrics-Platform-Planning, 10Platform Engineering, 10User-Urbanecm: Access to aggregate User Agent statistics - https://phabricator.wikimedia.org/T298912 (10Dreamy_Jazz) One thing that I was hoping to achieve in the medium to long term is to normalise the CheckUser table by splitting...