[00:09:28] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10SWinxy) Superset is uh... interesting. It doesn't seem like the charts are properly saved. Opening up created charts shows that both the result and the query empty. Sav... [00:16:42] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:10] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:28] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:42] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:36] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:43] (SystemdUnitFailed) firing: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:28] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:42] (SystemdUnitFailed) resolved: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:56] PROBLEM - Host an-worker1125 is DOWN: PING CRITICAL - Packet loss = 100% [06:54:58] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [07:16:31] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [07:20:01] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [08:20:20] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [08:20:56] (03PS1) 10TChin: Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) [08:21:30] (03CR) 10CI reject: [V: 04-1] Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [08:26:54] (03PS2) 10TChin: Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) [08:27:21] (03CR) 10CI reject: [V: 04-1] Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [08:37:10] !log rerun druid_load_webrequest_sampled_128_daily [08:37:10] Schedule: @daily info Next Run: 2023-05-25, 00:00:00 [08:37:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:37:43] !log rerun druid_load_webrequest_sampled_128_daily 2023-05-20 to reload missing hour (T337088) [08:37:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:37:47] T337088: Druid Webrequest sampled 128 has missing data data for 1 hour - https://phabricator.wikimedia.org/T337088 [08:48:18] joal: Here's an interesting thing. As an experiment, I started a nodemanager in the test cluster (an-test-worker1002) with both the spark2 and spark3 shuffler jars in the classpath. Is was expecting it to fail, but it didn't. It runs successfully. [08:48:24] https://usercontent.irccloud-cdn.com/file/D7DIM9ep/image.png [08:48:50] ok btullis - I wonder to which spark jobs can connect [08:48:52] I'm not really sure what to do with this information, but thought you might be interested. [08:50:19] I think shuffle services use port 7337 - possibly they'll compete reading the data, or only one of them will have it? [08:51:19] These are some of the startup log messages that mention both shufflers starting up, and that definitely mentions port 7337 as you suggest. [08:51:24] https://www.irccloud.com/pastebin/8dvxVoeo/ [08:52:14] It feels only one of them is running, but I'm not sure :) [08:52:33] Anyway, I'll probably undo it so that we know we're testing the spark3 shuffler, but thought you might be interested in the behaviour. Let me know if you have anything else you'd like me to do. [08:52:59] btullis: Have you tested jobs that actually touch data and use the shuffler? [08:53:24] joal: No, not specifically. [08:53:49] That'd be interesting - I can run the jobs if ou wish [08:55:23] I'm going to exclude an-test-worker1001 from yarn because of the issue with the hive missing symlinks. I'm also not sure if I've enabled the spark3 shuffler on an-test-worker1003 yet. [08:57:05] ack - I'll wait for you to have the test-cluster ready with the new shuffler installed everywhere, and then I'll run a job [09:20:28] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10elukey) [09:27:35] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [09:33:47] joal: Good to go if you want to test the spark3 shuffler service. The nodemanagers on an-test-worker100[2-3] are running the spark3 shuffler only. logs are at `/var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1002.log` (same on 3) [09:34:12] an-test-worker1001 is excluded from yarn for now. [09:40:22] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) I have discussed the timing of this with the #data-engineeri... [09:41:41] ack btullis - texsting [09:44:50] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A): eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10gmodena) a:03gmodena [09:47:46] (03CR) 10Peter Fischer: Encode redirect targets in page change events. (032 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [09:48:55] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Stuartyeates) Superset worked fro me a few days ago, but now I'm just getting https://superset.wmcloud.org/superset/sqllab/ > An error occurred while storing the lates... [09:56:30] btullis: my quwery have run successfully using the new shuffler, all good - However I see errors in the logs, related to container-execution [09:58:35] I think those errors are related to executors being killed after inactive time [10:00:47] joal: Ack. Could that have been when I restarted the nodemanagers earlier? on an-test-worker1002 I did it at: 09:02:02 UTC [10:03:33] joal: Oh I see. stacktraces like this? `2023-05-25 09:58:45,173 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. ` [10:05:51] https://www.irccloud.com/pastebin/PnHntAV8/ [10:06:12] joal: That kind of thing? [10:08:31] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Rebuild hive-hcatalog package for bullseye to address missing symlinks - https://phabricator.wikimedia.org/T337465 (10BTullis) [10:16:58] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) I have created {T337465} to address rebuilding the hive packages for bullseye and improving the bu... [10:27:35] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) Also experiencing a SQLAlchemy errors with `jupyterhub-conda.service` on stat1009 which has not been experienced on other hosts including... [10:28:28] joal: Do you think it's still OK for me to push out version 0.0.17 of conda-analytics to prod today? There's no substantive change to the environment, other than the addition of the jar file. [10:32:07] o/ airflow q: when using smth like wmf_props = VariableProperties('wmf_conf'). where is this "wmf_conf" set? [10:32:53] trying to investigate why some tasks of a dag got "removed" and suspecting a variable to have changed [10:33:02] https://people.wikimedia.org/~dcausse/airflow_task_removed.png [10:33:16] (03CR) 10Jbond: [C: 03+2] Switch to systemd [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah) [10:33:17] all *_codfw tasks got removed [10:33:20] (03CR) 10Jbond: [C: 03+2] udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) (owner: 10Jbond) [10:35:47] (03Merged) 10jenkins-bot: Switch to systemd [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah) [10:35:49] (03Merged) 10jenkins-bot: udplog: /etc/udp2log should be a folder not a file [analytics/udplog] - 10https://gerrit.wikimedia.org/r/922573 (https://phabricator.wikimedia.org/T276622) (owner: 10Jbond) [10:41:20] (03CR) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/interaction (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/919236 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [10:45:32] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure, 10serviceops, 10Patch-For-Review: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) 05Open→03Resolved a:03elukey [11:08:12] 10Data-Engineering, 10Data Pipelines (Sprint 13): Update Sqoop for externallinks table changes - https://phabricator.wikimedia.org/T335917 (10Antoine_Quhen) 05In progress→03Resolved [11:20:41] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) @SWinxy @Stuartyeates thank yinz for the bug reports. The upgrade earlier in the week appears to have introduced new permissions to the system. Those permissions... [11:32:42] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) [11:32:55] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) p:05Triage→03Unbreak! [11:40:35] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) The sqlite database file that is used by jupyterhub is present on the servers and hasn't been midified since the jupyterhub service was fir... [11:41:14] Heads-up, there is an issue with jupyterhub at the moment: T337471 I'm looking at it with high priority. [11:41:14] T337471: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 [11:44:51] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10JAllemandou) I see no problem to keeping a few days more of the `webrequest_sample_live` data in druid. Related question: Do we wish to remove `webrequest_sampled_128` and only keep t... [11:47:07] btullis: let me know if I can help with the jupyterhub issue [11:47:59] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) The most relevant change that I can see is this one: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/commit/25ea928a0... [11:50:20] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) ` btullis@an-test-client1001:~$ curl -o conda-analytics-0.0.12_amd64.deb https://gitlab.wikimedia.org/repos/data-engineering/conda-analytic... [11:54:01] It works if I downgrade conda-analytics to version 0.0.12 so I'm proposing to do that for all stats servers. [11:55:38] It will mean removing the iceberg jars from stats servers, which is what 0.0.13 (their current version) brought in. Unfortunately, it updated some other bits of the OS and introduced some kind of incompatibility with jupyterhub. It wasn't uncovered until today, when my unrelated change restarted jupyterhub-conda services. [11:57:22] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) It worked, so I'm going to downgrade conda-analytics to version 0.0.12 on all stats servers. CC @xcollazo Running the following to downlo... [12:04:11] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) Running the following to downgrade conda-analytics: ` btullis@cumin1001:~$ sudo cumin A:stat 'dpkg -i /tmp/conda-analytics-0.0.12_amd64.deb' ` [12:09:53] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) A temporary downgrading of the conda-analytics environment to version 0.0.12 will have a knock-on effect to the people working on Iceberg t... [12:10:57] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) >>! In T336036#8879479, @Stevemunene wrote: > Also experiencing a SQLAlchemy errors with `jupyterhub-conda.service` on stat1009 which has... [12:11:54] (03PS6) 10Peter Fischer: Encode redirect targets in page change events. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) [12:12:53] (03CR) 10Peter Fischer: "@ottomata, removed link_target.is_external" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [12:13:59] (03CR) 10Ottomata: "Nice. I'd expect to see fragment/mediawiki/state/change/page/1.0.0.yaml rematerialied too?" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [12:14:16] (03CR) 10Peter Fischer: "@ottomata, interwiki is not a synonym for wiki_id, see comment." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [12:24:34] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: mediawiki-page-content-change-enrichment checkpoints should be stored in Swift - https://phabricator.wikimedia.org/T336656 (10JMeybohm) I think I'm missing something. Is it possible to enable checkpoints without a "HA storag... [12:31:11] (03CR) 10Ottomata: "Thanks, comment about the data model being generic, we need to make sure we don't make something that can only work for redirects." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [12:31:28] !log set "loadByPeriod(P3D+future), dropForever" for webrequest_sampled_live in druid-analytics - T337460 [12:31:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:31:31] T337460: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 [12:32:14] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10elukey) >>! In T337460#8879706, @JAllemandou wrote: > I see no problem to keeping a few days more of the `webrequest_sample_live` data in druid. Thanks! I've set 3 days in the coordin... [12:38:33] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10JArguello-WMF) [12:40:20] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: mediawiki-page-content-change-enrichment checkpoints should be stored in Swift - https://phabricator.wikimedia.org/T336656 (10Ottomata) IIUC, checkpointing without HA allows the JobManager to [[ https://nightlies.apache.org/... [12:43:52] 10Data-Engineering, 10Event-Platform Value Stream, 10Release-Engineering-Team: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10Ottomata) [12:44:08] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10Volans) What do you mean by removing it? Making it start after 3 days? It's very useful to be able to check things also few days later and in particular in incident documents is very u... [12:47:43] (03CR) 10Ottomata: Encode redirect targets in page change events. (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/914867 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [13:01:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) @tchin for catalog stuff, should we just move https://www.mediawiki.org/wiki/Platform_Engineering_Team/... [13:05:15] 10Data-Engineering, 10Event-Platform Value Stream: Move wikimedia-event-utilities to gitlab - https://phabricator.wikimedia.org/T337477 (10Ottomata) [13:09:09] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10tchin) [13:09:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: mediawiki-page-content-change-enrichment checkpoints should be stored in Swift - https://phabricator.wikimedia.org/T336656 (10JArguello-WMF) [13:11:07] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken with sqlalchemy error - https://phabricator.wikimedia.org/T337471 (10BTullis) p:05Unbreak!→03High Lowering from unbreak now to high priority. We still need to work out how to get the iceberg jar into... [13:11:39] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) [13:13:13] (03PS3) 10TChin: Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) [13:13:38] (03CR) 10CI reject: [V: 04-1] Remove is_registered field from user entity fragment [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/923253 (https://phabricator.wikimedia.org/T337395) (owner: 10TChin) [13:14:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10JArguello-WMF) a:03Ottomata [13:16:22] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce alert noise associated with individual users' jupyterhub-singleuser services - https://phabricator.wikimedia.org/T336951 (10BTullis) The change was deployed, then jupyterhub broke. The change was reverted immediately, but jupyterhub re... [13:24:46] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) Here is where we pushed out conda-analytics 0.0.13 {T335721} [13:33:52] 10Data-Engineering: Druid Webrequest sampled 128 has missing data data for 1 hour - https://phabricator.wikimedia.org/T337088 (10JAllemandou) I have tried rerunning the loading job but this has not solved the issue. More investigation is needed. [13:37:46] 10Data-Engineering: Increase webrequest_sampled_live Druid datasource's retention - https://phabricator.wikimedia.org/T337460 (10elukey) I think that Joseph meant to suggest if we want to have only 30 days of `webrequest_sampled_live`, and deprecated `webrequest_sampled_128` (this is my understanding after readi... [13:42:04] !log rerun webrequest-refine job for 2023-05-20T00 - we're missing data [13:42:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:39] 10Data-Engineering: Druid Webrequest sampled 128 has missing data data for 1 hour - https://phabricator.wikimedia.org/T337088 (10JAllemandou) [13:51:50] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10xcollazo) For the time being, for `pyspark` jobs in stat servers you can create sessions like this: ` spark = wmfdata.spark.create_... [14:03:40] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) Oh, foof, we will have to rematerialize [[ https://schema.wikimedia.org/#!//primary/jsonschema/mediaw... [14:04:43] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: Remove user is_registered field from mediawiki/page/change schema - https://phabricator.wikimedia.org/T337395 (10Ottomata) [14:10:08] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10xcollazo) >>! In T337471#8879708, @BTullis wrote: > The most relevant change that I can see is this one: https://gitlab.wikimedia.o... [14:18:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:41] ^ is a real mw outage [14:31:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:31] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) All 4 sections are being sanitized now. [15:24:40] 10Data-Engineering, 10Event-Platform Value Stream: Move eventutiltities-python repo into main wikimedia-eventutilities repository - https://phabricator.wikimedia.org/T337491 (10Ottomata) [15:54:37] 10Data-Engineering, 10Event-Platform Value Stream: Flink-app helmfile deployments should ensure release and Flink job_name are equivalent - https://phabricator.wikimedia.org/T337496 (10Ottomata) [15:54:59] 10Data-Engineering, 10Event-Platform Value Stream: Flink-app helmfile deployments should ensure release and Flink job_name are equivalent - https://phabricator.wikimedia.org/T337496 (10Ottomata) [15:55:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [15:56:13] 10Data-Engineering, 10Event-Platform Value Stream: Flink-app helmfile deployments should ensure release and Flink job_name are equivalent - https://phabricator.wikimedia.org/T337496 (10Ottomata) [16:00:36] 10Data-Engineering, 10Event-Platform Value Stream: Flink-app helmfile deployments should ensure release and Flink job_name are equivalent - https://phabricator.wikimedia.org/T337496 (10Ottomata) Ah, I wanted to make job_name == release name the default behavior in the flink-app chart, but that's not possible,... [16:04:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:42] 10Data-Engineering, 10Event-Platform Value Stream: Flink metrics should use a consistent label for each job - https://phabricator.wikimedia.org/T337496 (10Ottomata) [16:12:08] 10Data-Engineering: Druid Webrequest sampled 128 has missing data data for 1 hour - https://phabricator.wikimedia.org/T337088 (10BTullis) a:03JAllemandou [16:12:29] 10Data-Engineering, 10Data Pipelines (Sprint 13): Druid Webrequest sampled 128 has missing data data for 1 hour - https://phabricator.wikimedia.org/T337088 (10BTullis) [16:16:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:43] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:17] 10Data-Engineering, 10Event-Platform Value Stream: Flink metrics should use a consistent label for each job - https://phabricator.wikimedia.org/T337496 (10Ottomata) Nope, I don't think that works. Makes sense too. It is possible for a Flink JobManager (even in app deployment mode) to run multiple jobs. So... [17:10:50] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) db1155:3317 replicating [17:12:44] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [17:41:47] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) [17:58:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) @achou Oh, you know, we should probably version this stream. https://wikitech.wikimedia.org/wiki/Event_Platform/Str... [18:00:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) [19:08:57] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10SWinxy) I can see graphs again! [19:26:18] 10Data-Engineering, 10Event-Platform Value Stream: Flink metrics should use a consistent label for each job - https://phabricator.wikimedia.org/T337496 (10Ottomata) [19:28:21] 10Data-Engineering, 10Event-Platform Value Stream: Flink metrics should use a consistent label for each job - https://phabricator.wikimedia.org/T337496 (10Ottomata) Okay, after working on the dashboard a bit, I think I'm going to resolve this task. We have `release` which does uniquely identify a FlinkDeploy... [19:28:42] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [19:28:51] 10Data-Engineering, 10Event-Platform Value Stream: Flink metrics should use a consistent label for each job - https://phabricator.wikimedia.org/T337496 (10Ottomata) 05Open→03Resolved a:03Ottomata [19:30:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [19:30:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [19:45:59] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) I was wrong in my guess about the cause of the differences, it seems that it's not related to the update of buster. I did... [19:56:09] 10Quarry: Give quay.io/wikimedia-quarry access to framawiki - https://phabricator.wikimedia.org/T337516 (10Framawiki) [19:57:50] 10Quarry: Give quay.io/wikimedia-quarry access to framawiki - https://phabricator.wikimedia.org/T337516 (10rook) a:03rook [19:58:07] 10Quarry: Give quay.io/wikimedia-quarry access to framawiki - https://phabricator.wikimedia.org/T337516 (10rook) c'est fait! [19:58:13] 10Quarry: Give quay.io/wikimedia-quarry access to framawiki - https://phabricator.wikimedia.org/T337516 (10rook) 05Open→03Resolved [20:00:49] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [20:00:56] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) [20:32:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Flink App Deployment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) @dcausse @bking @gmodena @tchin I'm feeling good about this [[ https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app | Flink App Dashboard ]] (prev... [21:08:42] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_req... [21:08:53] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10CodeReviewBot) [21:19:19] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_req... [21:39:28] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Stuartyeates) My query works again. Thank you, rook. [22:09:07] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) I've released version 0.0.18 of conda-analytics. Next I'm installing it to an-test-client1002 and I'll do another diff of... [22:16:43] (SystemdUnitFailed) firing: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:24] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) OK, this looks much better. Some different build numbers, but the only python package difference between version 0.0.12 an... [22:19:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:24] 10Quarry, 10cloud-services-team (FY2022/2023-Q4): Move Quarry to be an installation of Superset - https://phabricator.wikimedia.org/T169452 (10SWinxy) [22:25:50] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) Pulled the package to the apt server: ` btullis@apt1001:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/... [22:28:13] (DiskSpace) firing: Disk space stat1008:9100:/ 5.672% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1008 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:31:33] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:43] (SystemdUnitFailed) resolved: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:57] (SystemdUnitFailed) firing: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:50] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure (Q4 Wrap up): Jupyterhub is broken in conda-analytics 0.0.13 - https://phabricator.wikimedia.org/T337471 (10BTullis) I've restarted the jupyterhub-conda service on stat1008 and connected successfully, so I think this is now fixed. [22:36:43] (SystemdUnitFailed) resolved: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed