[00:20:57] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [00:41:40] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:45:39] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:52:08] (03PS1) 10DLynch: EditAttemptStep: add new values for init_mechanism [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805728 (https://phabricator.wikimedia.org/T298634) [08:37:06] hello folks [08:37:48] an-tool1009 seems having an issue with apache2, yesterday at around 10:33 UTC the CAS settings got removed [08:38:13] I don't recall exactly if Hue was working with CAS or not though [08:38:44] the change applied seems https://gerrit.wikimedia.org/r/c/operations/puppet/+/805191 [08:38:54] but I can't find a correlation [08:39:02] and we have profile::hue::enable_cas: false [08:48:37] Thanks elukey. I will take a look at it. [09:24:58] The `auth_cas` module had been automatically enabled by `/var/lib/dpkg/info/libapache2-mod-auth-cas.postinst` after the new CAS-SSO packages were installed recently. [09:26:58] Puppet runs cleanly now and Apache has restarted, but I can't log in to Hue now. [09:52:41] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:02:16] (03CR) 10Kosta Harlan: [C: 03+2] Add other_reason action_data to image_suggestion_interaction and link_suggestion_interaction schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805418 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [10:02:53] (03Merged) 10jenkins-bot: Add other_reason action_data to image_suggestion_interaction and link_suggestion_interaction schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805418 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [10:29:40] btullis: I am a little puzzled since I saw in the puppet logs that CAS-related settings got removed, IIRC we enabled a plugin in Hue to pick the username from a CAS environment variable set by mod_cas [10:30:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/628741 [10:30:37] Yeah, I'm working with moritz.m on this right now. It's like `enable_cas` just got switched overnight and the templates have been swapped out here: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/hue.pp#L138-L155 [10:31:21] I also recall https://gerrit.wikimedia.org/r/c/operations/puppet/+/678860/2/hieradata/hosts/an-tool1009.yaml [10:31:29] there were some extra params [10:31:42] but probably they ended up in the main role config [10:32:23] ah yes yes [10:50:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10BTullis) [10:51:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10BTullis) p:05Triage→03Medium [11:17:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10Patch-For-Review: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10MoritzMuehlenhoff) Ben and myself did some debugging: While we had been using CAS for Hue for the last two years, it was ne... [11:18:58] elukey: This is fixed now. We changed `profile::hue::enable_cas:` to `true` [12:05:32] (03CR) 10Joal: [WIP] Add projectview hql scripts to analytics/refinery/hql path. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [12:07:29] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM! Merging and adding to deployment list" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/805446 (https://phabricator.wikimedia.org/T309987) (owner: 10Milimetric) [13:09:53] btullis: weird :( [13:28:55] 10Data-Engineering, 10Data-Engineering-Kanban, 10Beta-Cluster-Infrastructure, 10Event-Platform: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) - https://phabricator.wikimedia.org/T304433 (10Ottomata) 05Open→03Resolved [13:37:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10EventStreams, 10Patch-For-Review: Expose mediawiki/revision/tags-change in stream.wikimedia.org - https://phabricator.wikimedia.org/T294391 (10Ottomata) a:03Ottomata [13:48:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10EventStreams, 10Patch-For-Review: Expose mediawiki/revision/tags-change in stream.wikimedia.org - https://phabricator.wikimedia.org/T294391 (10Ottomata) Getting a strange error when trying to deploy: ` command "/usr/bin/helm3" exited wi... [13:49:06] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Add Event Platform timestamp JSONSchema -> Flink type support - https://phabricator.wikimedia.org/T310495 (10Ottomata) [13:51:49] btullis: are we doing the DataHub upgrade? [13:52:08] Yeah, I was just about to ping you as well. :-) [13:52:24] Just researching how the index rebuild might be done. [13:52:27] * milimetric checks telepathy booster [13:52:44] k, I can join in cave if you want a partner [13:57:02] Mmm. They build a special jar for it, which we don't do at the moment: https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/Dockerfile#L16 [13:57:02] I wonder if I can just build it on a stat box and run the command from there? [13:57:52] OK, give me a few minutes and then let's coordinate in the cave? How did the scheduled run go last night, I've not looked? [13:58:50] failed for a silly path mistake (had leftovers from when the job was in analytics_test) [13:58:57] so I was going to try it again after the upgrade [13:59:12] OK, cool. [14:09:12] (03PS4) 10Milimetric: Add datahub metadata ingestion CLI as a conda env [analytics/refinery] - 10https://gerrit.wikimedia.org/r/792215 (https://phabricator.wikimedia.org/T307714) [14:09:41] Here is the deployment-charts CR. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/805826 [14:12:00] btullis: so when do we want to rebuild indices? Like why are we doing it now? [14:16:04] oh, unrelated, it looks like we're running out of disk space with the Gitlab CI: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/21076 [14:45:15] milimetric: Could you add a note to this ticket about the GitLab CI issue? https://phabricator.wikimedia.org/T310593 I think it's a recurring problem. [14:45:33] k [14:46:23] ottomata: o/ if you have time during the next days can you let me know if https://github.com/wikimedia/ores/pull/361 makes sense? Just to avoid pebcaks :) [14:46:49] Oh, maybe we won't need to rebuild the indices because our glossary doesn't currently exist: https://github.com/datahub-project/datahub/releases/tag/v0.8.36 I missed that part of the release notes. [14:48:23] !log deploying datahub 0.8.38 [14:48:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:49:38] oh yeah, cool "If this is your first time using DataHub Glossaries, you're all set!" [14:50:43] Ah, looks like it didn't work on staging. I'll have to look into it again. [15:06:37] elukey: responded with a NIT, but LGTM! [15:10:24] <3 [16:28:38] btullis: milimetric puppet failling on an-launcher1002 i thinkk because https://gerrit.wikimedia.org/r/c/operations/puppet/+/802598 [16:28:48] Could not find resource 'File[/usr/local/bin/refinery-sqoop-mediawiki]' in parameter 'require' (file: /etc/puppet/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp, line: 81) on node an-launcher1002.eqiad.wmnet [16:29:45] Doh! Thanks. Will look into it now. [17:14:09] Hi! it seems that we're seeing some increased logspam from AQS due to some cassandra cluster being down [17:14:25] https://grafana.wikimedia.org/d/VCK8-FpZz/cwhite-logstash?orgId=1&refresh=1m&from=now-2d&to=now [17:15:18] https://logstash.wikimedia.org/goto/39388e6839807924b6e43cda87487fda [17:15:39] cwhite: thanks! [17:16:24] Hi lmata - thanks for this. Pinging urandom: as well, since he has been helping to bootstrap this new cluster and start migrating data. [17:17:11] btullis: ty! [17:20:31] Yeah, it looks like all of the logs are coming from aqs2* hosts, which are not yet properly in service. They're downtimed in Icinga, but I hadn't thought about how to disable their log shipping. [17:22:04] ottomata: joal: That puppet issue on an-launcher1002 is fixed now, I believe. [17:25:15] lmata, cwhite - looks like it has subsided now. Would you agree? [17:26:16] btullis: ty [17:58:15] btullis: ty! [18:07:28] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.113 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [20:13:17] thanks for cleaning up my mistake Ben! [20:27:01] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.069 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [20:50:10] 10Data-Engineering, 10Foundational Technology Requests, 10Product-Analytics: "Source of truth" dataset for pageviews - https://phabricator.wikimedia.org/T310732 (10DAbad) a:03EChetty [21:01:30] milimetric: a pleasure. I should have spotted it before +2 ing it, but there were go. [21:02:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:32] I'll make a ticket for this aqs1008.mgmt interface flapping. It's noisy on this channel since the Icinga change I made recently. [21:35:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) [22:46:12] RECOVERY - Check systemd state on an-tool1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:18] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.998 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos