[00:10:59] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 67 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [00:11:21] (03PS1) 10Zabe: Add ckb.wiktionary to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/903827 (https://phabricator.wikimedia.org/T332093) [01:33:31] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:21] (03CR) 10Sharvaniharan: "Looks good to me @Mazevedo and @Tsevener. Sorry for the delay in review. Please feel free to merge. Please let us know when the config for" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [01:57:52] (03CR) 10Sharvaniharan: [C: 03+1] Add new unified mobile apps schema for Session [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [03:35:12] 10Data-Engineering: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688 (10fkaelin) @MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will make things easier. The `.tar.gz` format of... [03:45:26] (03PS1) 10DLynch: EditAttemptStep: Add a new abort type for page updates [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/903845 (https://phabricator.wikimedia.org/T301582) [05:33:31] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:56] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Joe) >>! In T330507#8735183, @Ottomata wrote: > @Joe we disc... [06:54:55] 10Data-Engineering: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688 (10MGerlach) >>! In T305688#8737195, @fkaelin wrote: > @MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will mak... [07:03:17] Hi, I was checking how to access the HTML dumps from the stat-machines (/mnt/data/xmldatadumps/public/other/enterprise_html/) and it seems that the latest available snapshot there is 20221001. However, there are more recent snapshots available in the public HTML dumps (such as https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/). would it be possible to make the most recent snapshots of the HTML dumps available from stat? [07:29:18] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [07:36:40] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks again everybody! [07:46:58] btullis, steve_munene o/ [07:47:22] (03PS5) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [07:48:20] I am working on kafka-main nodes, some of them are old and Moritz suggested to just dist-upgrade them, I came up with https://phabricator.wikimedia.org/T332013#8733165 that seem to work really well. I see that kafka-jumbo nodes will be replaced (at least, most of them), so if you want to try the dist-upgrade road lemme know [07:48:48] moreover, I'd like to go ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/903245 [07:48:52] lemme know your thoughts :) [07:59:21] Morning elukey , having a look [08:00:08] (03PS6) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [08:03:11] (03PS7) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [08:31:54] elukey: Morning. Thanks so much for all of the input. I +1d the PKI change. [08:35:17] I'm normally a big fan of the dist-upgrade method to upgrade Debian in-place, but I'd be a little reticent to go straight to it for kafka-jumbo, unless there's a really compelling reason to do so. [08:42:10] We've got kafka-jumbo101[0-5] ready to go on bullseye. kafka-jumbo100[1-5] due to be retired, but that would leave kafka-jumbo100[6-9] just /slightly/ different from the other six nodes. [08:43:11] (03CR) 10Aqu: Migrate refine webrequest to Airflow (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [08:43:29] mgerlach: I will have a look and see what is possible. [08:45:24] mgerlach: Is this the ticket that's related to your request? T305688 [08:45:24] T305688: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688 [08:52:44] btullis: thanks. T305688 is not related. the HTML dumps used to be accessible from the stat-machines (not hadoop) but it looks like that the 20221001 is the last snapshot that is available even though the enterprise dumps contain more recent ones. [08:52:44] T305688: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688 [09:01:46] btullis: sure makes sense, it was just as FYI if you need a battle tested procedure :) [09:09:29] kafka-jumbo1001 restarted, cluster is recovering, so far all good [09:10:35] weird, I don't see TLS metrics in https://grafana.wikimedia.org/d/000000253/varnishkafka [09:11:12] (03PS2) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) [09:13:02] (03CR) 10Lgaulia: Add first input delay schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia) [09:14:04] I see all metrics like [09:14:05] rdkafka_producer_broker_txbytes{broker="ssl://kafka-jumbo1001.eqiad.wmnet:9093/1001", client_id="varnishkafka", cluster="cache_text", instance="cp1075:9132", job="varnishkafka", prometheus="ops", site="eqiad", source="configured"} [09:14:22] source="configured" is weird, IIRC it should be webrequest_text/upload [09:20:30] something change, we should probably investigate [09:33:32] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:08] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [09:44:11] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [09:45:28] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10elukey) [09:54:48] elukey: Yes I see, that is weird. [09:56:02] mgerlach: OK, understood. I'm afraid I don't know much about these HTML dumps. I will try to find out more and ask around, but others may well know more than I do about it. [10:03:37] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade Spark to 2.4.x - https://phabricator.wikimedia.org/T222253 (10BTullis) [10:03:53] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Rebuild spark2 for Debian Buster - https://phabricator.wikimedia.org/T229347 (10BTullis) 05Open→03Resolved Hi @jbond - Many thanks for the heads-up. We've been working on the upgraded an-test-worker1001 to try to work out what needs to be adapted in or... [10:45:21] btullis: thanks [12:21:43] (03PS1) 10Jennifer Ebe: T330199-Migrate-VirtualPageView-HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 [12:48:37] after switching to conda-analytics and Spark-3, I keep on getting "File save error: Forbidden" on JupyterLab. Please see: https://capture.dropbox.com/UG1qyW9Pc9hTJywD. Is this intended? [12:49:30] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10Ottomata) FWIW, API platform folk are talking about this for API guidelines now. W... [12:50:44] (03CR) 10Joal: "Only four small things (mostly naming) - Can you confirm that ou have tester the code and that data is the same as the original?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 (owner: 10Jennifer Ebe) [12:51:36] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) > Is that correct? Correct! [12:54:14] Heya mforns - I have a question for you :) [12:55:46] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for gurwiki - https://phabricator.wikimedia.org/T327841 (10BTullis) 05Open→03Resolved a:03BTullis I believe that this has now been created, including the row in `meta_p.wiki` ` btullis@tools-sgebastion-... [12:59:44] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10BTullis) @MarcoAurelio - I believe that these seven wikis you mentioned are all present in `meta_p.wiki` now. ` btullis@tools-sgebastion-10:~$ sql meta_p MariaDB [meta_p... [13:01:16] aarora: You're the first reporting that error, I assume something must be incorrectly configured for you [13:03:13] the old setup works perfectly for me, even now. I migrated to the new setup on stat1008, where I am facing this error. FYI, I just tested and this happens after executing any cell, and gets resolved only after refreshing the browser window. [13:04:59] I followed the instructions, and not sure what could possibly the error. Can you kindly help figure it out? [13:05:40] 10Data-Engineering, 10Event-Platform Value Stream: EventStreamCatalog should not remove user specified options in CREATE TABLE statements - https://phabricator.wikimedia.org/T331542 (10tchin) a:03tchin [13:06:03] aarora: I have a hint of a feeling that one person has asked this question before, but it would have been some time ago, when `conda-analytics was brand new. [13:06:39] aarora: What do you get if you type `conda env list` ? [13:07:01] aarora: the first thing I'd try wou;d be to restart the jupyterhub environment [13:07:43] ottomata: Many thanks for re-running refine_event [13:08:31] okay, so as I stated previously, the issue is not persistent, whenever it happens I restart the browser and everything returns back to normal, but the issue comes back after 5-10 minutes of the restart. I tried restarting the JupterHub environment, no luck! [13:08:49] # conda environments: [13:08:49] # [13:08:50] 2021-08-03T20.29.11_aarora /home/aarora/.conda/envs/2021-08-03T20.29.11_aarora [13:08:50] entity_insertion /home/aarora/.conda/envs/entity_insertion [13:08:50] base * /home/aarora/.conda/envs/orphans [13:08:50] /home/aarora/anaconda3 [13:08:50] /home/aarora/anaconda3/envs/kerasnew [13:08:51] /home/aarora/anaconda3/envs/pub_priv [13:09:50] the orphans environment was created using `conda-analytics-clone orphans` as instructed on https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/conda-analytics [13:11:29] sorry, I meant "refresh" the browser tab and not "restart".. [13:13:44] aarora: as it seems browser related, have you tried cleaning cookies? [13:14:05] indeed.. [13:15:08] also, I don't get it why do you state its browser related, IMHO, refreshing the browser just reloads the most recent state of the Jupyter notebooks.. [13:15:11] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JMeybohm) >>! In T330507#8737310, @Joe wrote: > if that's th... [13:15:33] Something is odd about the list above, which is that the asterisk is on the same line as the `base` environment, suggesting that this is the environment that is active. [13:16:18] aarora: makes sense [13:16:20] When I create a new environment on stat1008 with `conda-analytics-clone btullis` and then activate it with `source conda-analytics-activate btullis` I see the following. [13:16:44] https://usercontent.irccloud-cdn.com/file/nKfoEJ23/image.png [13:18:21] I can't see the `/opt/conda-analytics` in your output, so perhaps there was an issue over which base environment was cloned. That's the best guess I have so far. [13:18:26] well, if I do it on the terminal I also get something similar to what you get: https://capture.dropbox.com/4QVmZYzfDuOhQdNz [13:18:49] (03PS2) 10Jennifer Ebe: T330199-Migrate-VirtualPageView-HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 [13:18:54] but the output that I shared above was by executing it directly on Jupyter: `!conda env list` [13:19:08] (03CR) 10Jennifer Ebe: T330199-Migrate-VirtualPageView-HQL (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 (owner: 10Jennifer Ebe) [13:19:35] oh yes, even on the terminal I don't see /opt/conda-analytics [13:19:42] as the base [13:19:50] what do you suggest I should do? [13:21:42] I have a local anaconda installed on the stat machines as well, I did it 2 years ago, when there was no support for conda from analytics. Should I delete that and then try again? [13:23:19] Yes, I would try to start again from a clean base using `conda-analytics` if possible. [13:24:05] SSH into stat1008, backup `~/anaconda3` and `~/.conda` and start again with a `conda-analytics-clone orphans` [13:25:04] You might want to stop your running jupyter server first, if possible. It'll probably get confused if its running conda environment gets deleted anyway. [13:26:12] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/903827 (https://phabricator.wikimedia.org/T332093) (owner: 10Zabe) [13:26:50] btullis: Have you deployed airflow yesterday? [13:27:03] thanks @btullis! will report with the fresh setup in a bit.. [13:27:49] joal: Oh, no I didn't. I wanted to check on the state of the airflow-dags repo before doing so in a rush at the end of the day. Sorry, should I do it now? [13:28:11] aarora: Great, hope it helps. DO let us know. [13:28:15] btullis: there is the possiblity of us adding more to the deploy soon - let's wait [13:28:35] sorry for having been unhelpful aarora - I'm not a python person :) [13:28:52] joal: Yes, sure thing. [13:29:14] no worries @joal. Let's see if it works out :) [13:32:58] @btullis did that, but still got the same error after a while.. [13:33:16] https://capture.dropbox.com/QOJr7HASBsP5bxTD [13:33:24] (03CR) 10Joal: "Still 3 nits :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 (owner: 10Jennifer Ebe) [13:33:32] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:03] aarora: Oh, and where are you actually trying to save the file? I'm checking the file system permissions beneath `/home/aarora`on stat1008 in case anything is odd. Does your `conda env list` look like mine now? [13:37:20] I see it. [13:37:24] https://www.irccloud.com/pastebin/5MXdtSsw/ [13:37:49] yes, I have the permissions. The saving location is `/home/aarora/orphans`, where the permissions are correct, I own the directory.. [13:37:59] https://capture.dropbox.com/Zypfi8uzTzcKG4tT here's the `conda env list` output [13:39:28] I don't think the permission error stems from the filesystem, basically, there's some issue with the kernel. I guess it becomes unresponsive or goes to an inconsistent state. All of this eventually boils down to Jupyter or the conda env not behaviing properly.. [13:43:23] Yeah, I'm afraid I'm a bit stumped. I'm following what's happening with `journalctl -u jupyter-aarora-singleuser-conda-analytics.service -f` [13:44:13] It saves the file frequently, because I can see `I 2023-03-29 13:42:42.239 SingleUserNotebookApp handlers:171] Saving file at /orphans/test_spark3.ipynb` but I can also see some 'Forbidden` messages and I'm not sure what's generating them. [13:45:36] indeed, this is exactly what's happening. The file gets saved at time, but from time to time I see those "forbidden" messages, which I presume come when the kernel enters into an inconsistent state, where only "refreshing" the browswer tab acts as a solution, and I get the notebook restored at the last save point.. [13:45:59] also, funnily, the kernel remains active at that point, that is, the kernel doesn't get restarted, weird.. [13:55:30] aarora: It looks very similar to this issue, which hasn't been solved.: https://github.com/bitnami/charts/issues/7427 [13:56:05] I googled the string: `SingleUserNotebookApp handlers:612] Forbidden` [13:57:30] @btullis going through the issue now, but I am puzzled what's exactly causing this issue? [13:58:33] is there a specific version of the JupyterLab or any other dependency? Also, curious if no one else is facing the same issue? If its a version-specific issue, then everything should be facing this, corret? [13:59:54] I don't know either, I'm afraid. Nobody else has complained of this error yet. Does it affect you if you try a different stat server? [14:01:03] tried on stat1005 as well, the same issue.. [14:03:24] is this warning normal: "Warning: JupyterHub seems to be served over an unsecured HTTP connection. We strongly recommend enabling HTTPS for JupyterHub."? I see this when logging into JupyterHub. [14:04:19] Yes, that's normal. We don't use HTTPS because your connection is already secured by your SSH tunnel. [14:04:48] okay.. [14:05:03] Are you using a regular command-line `ssh` to create the tunnel? e.g. ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880 [14:05:16] yes.. [14:05:24] ssh -N stat1005.eqiad.wmnet -L 8100:127.0.0.1:8880 [14:05:35] Hmm, ok thanks. [14:08:25] can running JupyterHub simultaneously on two stat machines (ofcourse with different ports) be a problem? [14:12:05] I don't get any errors now after terminating one of the two ssh tunnels. I don't know why this should have been a cause, but just wanted to share this if this can be a potential issue. Thoughts? [14:13:20] previously, I had two active tunnels, one on the port 8100 for stat1005 and the other on the port 8200 on stat1008. Since the conda environments are local on each machine, and the tunnel is established on different ports, I don't think there shouldn't be any issue.. [14:13:24] 10Data-Engineering: Investigate CPU usage on an-launcher1002 - https://phabricator.wikimedia.org/T308998 (10BTullis) I believe that the system load on an-launcher1002 is much lower than it was when this ticket was created. Therefore, I would argue that we shouldperhaps close the ticket. However, I just checked... [14:14:25] aarora: I don't think that there should be a problem. I'm afraid I have to be afk for a while now, so I can't look into this any further at the moment. Please feel free to make a ticket with these observations too. [14:18:41] (03PS3) 10Mazevedo: Add new unified mobile apps schema for Session [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) [14:19:04] (03PS8) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [14:20:00] 10Data-Engineering, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team: Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10BPirkle) >>! In T332212#8738153, @Ottomata wrote: > FWIW, API platform folk are tal... [14:22:10] (03PS9) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) [14:28:31] thanks @btullis will do.. [14:29:17] (03CR) 10Tsevener: [C: 03+2] Add new unified mobile apps schema for Session [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [14:29:47] (03Merged) 10jenkins-bot: Add new unified mobile apps schema for Session [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/898851 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [14:36:44] joal: hi! did you have a question? [14:47:01] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10MarcoAurelio) Hello @BTullis - thank you. I checked for the list of wikis created in 2022 and 2023 as per ` MariaDB [meta_p]> SELECT dbname, lang FROM wiki WHERE dbname... [15:04:53] (03PS3) 10Jennifer Ebe: T330199-Migrate-VirtualPageView-HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 [15:07:24] (03CR) 10Jennifer Ebe: T330199-Migrate-VirtualPageView-HQL (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 (owner: 10Jennifer Ebe) [15:15:39] (03CR) 10Aqu: Migrate refine webrequest to Airflow (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [15:21:07] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Product-Analytics: Migrate all reportupdater queries to hive - https://phabricator.wikimedia.org/T205296 (10BTullis) I would think that this ticket should probabaly be closed now. We're looking to deprectate the `hive` CLI and mapreduce, in favour of either Sp... [15:22:51] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Product-Analytics: Migrate all reportupdater queries to hive - https://phabricator.wikimedia.org/T205296 (10BTullis) 05Open→03Resolved a:03BTullis Similarly, we're looking to migrate reportupdater jobs to Airflow as the scheduler: {T307540} I'll go ahead... [15:24:26] 10Data-Engineering: Investigate CPU usage on an-launcher1002 - https://phabricator.wikimedia.org/T308998 (10BTullis) What do you think @JAllemandou ? Should we just close this ticket about the CPU load on an-launcher1002. We already have a task related to the migration of reportupdater jobs to Airflow: {T307540... [15:24:56] btullis: yw! [15:31:43] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10BTullis) 05Open→03Resolved Thanks for confirmation @MarcoAurelio > As such I think this Task may be closed as resolved unless there's further investigation that nee... [15:42:00] 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10aborrero) [15:43:33] hey mforns - I had a question, but no more, I found my answer :) [15:44:34] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM! Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904159 (owner: 10Jennifer Ebe) [15:44:52] joal: o/ for your awareness - kafka-jumbo1001 is running with a different TLS certificate (not generated by the puppet ca, but from our PKI) [15:45:08] I checked a lot of things and so far nothing seems to be problematic [15:45:27] the worst case scenario is a client that accepts only puppet-based TLS certs, trying to connect to it [15:45:33] but so far I didn't find trace of any [15:45:54] once we are confident that the change works, we'll flip all other certs to PKI [15:45:59] to complete the migration [15:46:09] so if you see anything weird ping me Ben or Steve please :) [15:47:32] 10Data-Engineering, 10WMDE-References-FocusArea, 10WMDE-TechWish-Sprint-2023-03-14: Shut down our previous Cloud VPS project and create a new one - https://phabricator.wikimedia.org/T332040 (10Andrew) [15:48:34] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) a:03Ottomata [15:51:58] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) [15:52:23] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Machine-Learning-Team: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) [15:58:49] 10Data-Engineering, 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10BTullis) Circling around to this old problem, if indeed it's still a problem. From what I can see, although the hosts have all been refreshed since the last entry on this ticke... [15:59:21] ack elukey! Thanks for letting us know :) [15:59:37] I'm gonna monitor closely our next gobblin runs elukey [16:00:16] joal: I re-checked and it uses the right ca bundle so it shouldn't be angry at me :D [16:00:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [16:00:29] \o/ [16:00:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [16:01:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) [16:01:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) 05Open→03Resolved [16:01:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [16:02:34] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Epic: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [16:04:14] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:04:21] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:04:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Epic: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [16:04:31] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:04:38] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) Done: {T333464} [16:04:45] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:05:29] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform Value Stream: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) Next steps: * Move the remaining nodes to PKI [16:05:33] 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [16:05:35] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:06:00] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for gucwiki - https://phabricator.wikimedia.org/T326235 (10BTullis) 05Open→03Resolved a:03BTullis [16:14:44] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10jcrespo) [16:19:08] (DiskSpace) firing: Disk space stat1008:9100:/srv 5.614% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1008 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:22:52] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Epic, 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [16:24:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Epic, 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [16:26:38] 10Data-Engineering: Investigate CPU usage on an-launcher1002 - https://phabricator.wikimedia.org/T308998 (10JAllemandou) 05Open→03Resolved a:03JAllemandou We discussed this in standup: the CPU load of the server has gone to an acceptable rate, and we plan to tackle report-updater queries as part of the air... [16:28:06] 10Data-Engineering-Planning, 10serviceops, 10Epic, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [16:44:55] (03PS2) 10Joal: Add compute_mediawiki_history_reduced.hql in hql folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/901545 [17:32:34] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Cmjohnson) This is showing 6 disks failed. Is it possible there is a different problem that is causing the disks to fail? I do not see any errors for the raid controller [17:33:31] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:33] Hi SRE folks - Could anyone of you do a quick check of /srv on stat1008 please? We received an alert about disk being almost full [18:12:00] joal: yep, 97% full [18:12:09] looking [18:12:17] Thank you sukhe [18:33:13] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10xcollazo) > It is a non airflow node & user specific solution. I'm confused; @Htriedman would like to use this from Airflow, so we do need an Airflow operat... [18:36:03] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JAllemandou) >>! In T317167#8739562, @xcollazo wrote: >> It is a non airflow node & user specific solution. > I'm confused; @Htriedman would like to use thi... [18:37:53] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Cmjohnson) @wiki_willy all 3 of these servers are well out of warranty (2-3 years). analytics1068 is marked failed in netbox [18:49:08] (DiskSpace) resolved: Disk space stat1008:9100:/srv 5.559% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1008 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:02:05] 10Data-Engineering-Planning, 10Data Pipelines: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10xcollazo) >>! In T317167#8739573, @JAllemandou wrote: >>>! In T317167#8739562, @xcollazo wrote: >>> It is a non airflow node & user specific solution. >> I'... [19:50:48] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for anpwiki - https://phabricator.wikimedia.org/T332458 (10BTullis) 05Open→03Resolved a:03BTullis [19:59:26] 10Analytics-Radar, 10DC-Ops, 10SRE, 10SRE-swift-storage, 10ops-eqiad: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10wiki_willy) @Jclark-ctr has a few spares onsite, so we can probably use those as replacements. Thanks, Willy [21:00:35] 10Data-Engineering: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10BTullis) a:03BTullis It's a little over three years since the last update, so I'm revisiting this ticket and I'll try to reach consensus on what to do. We're still talking about 687 GB of data on stat1007... [21:14:45] (03CR) 10Aqu: "1 open question waiting for test." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [21:21:33] 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10TheresNoTime) [21:33:34] (SystemdUnitFailed) firing: (7) wmf_auto_restart_envoyproxy.service Failed on an-test-ui1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed