[00:20:57] <wikibugs>	 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans)
[00:41:40] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:45:39] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:52:08] <wikibugs>	 (03PS1) 10DLynch: EditAttemptStep: add new values for init_mechanism [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805728 (https://phabricator.wikimedia.org/T298634)
[08:37:06] <elukey>	 hello folks
[08:37:48] <elukey>	 an-tool1009 seems having an issue with apache2, yesterday at around 10:33 UTC the CAS settings got removed
[08:38:13] <elukey>	 I don't recall exactly if Hue was working with CAS or not though
[08:38:44] <elukey>	 the change applied seems https://gerrit.wikimedia.org/r/c/operations/puppet/+/805191
[08:38:54] <elukey>	 but I can't find a correlation
[08:39:02] <elukey>	 and we have profile::hue::enable_cas: false
[08:48:37] <btullis>	 Thanks elukey. I will take a look at it.
[09:24:58] <btullis>	 The `auth_cas` module had been automatically enabled by `/var/lib/dpkg/info/libapache2-mod-auth-cas.postinst` after the new CAS-SSO packages were installed recently.
[09:26:58] <btullis>	 Puppet runs cleanly now and Apache has restarted, but I can't log in to Hue now.
[09:52:41] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:02:16] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] Add other_reason action_data to image_suggestion_interaction and link_suggestion_interaction schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805418 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[10:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: Add other_reason action_data to image_suggestion_interaction and link_suggestion_interaction schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/805418 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[10:29:40] <elukey>	 btullis: I am a little puzzled since I saw in the puppet logs that CAS-related settings got removed, IIRC we enabled a plugin in Hue to pick the username from a CAS environment variable set by mod_cas
[10:30:10] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/628741
[10:30:37] <btullis>	 Yeah, I'm working with moritz.m on this right now. It's like `enable_cas` just got switched overnight and the templates have been swapped out here: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/hue.pp#L138-L155
[10:31:21] <elukey>	 I also recall https://gerrit.wikimedia.org/r/c/operations/puppet/+/678860/2/hieradata/hosts/an-tool1009.yaml
[10:31:29] <elukey>	 there were some extra params
[10:31:42] <elukey>	 but probably they ended up in the main role config
[10:32:23] <elukey>	 ah yes yes
[10:50:34] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10BTullis)
[10:51:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10BTullis) p:05Triage→03Medium
[11:17:32] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10Patch-For-Review: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10MoritzMuehlenhoff) Ben and myself did some debugging: While we had been using CAS for Hue for the last two years, it was ne...
[11:18:58] <btullis>	 elukey: This is fixed now. We changed `profile::hue::enable_cas:` to `true`
[12:05:32] <wikibugs>	 (03CR) 10Joal: [WIP] Add projectview hql scripts to analytics/refinery/hql path. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu)
[12:07:29] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM! Merging and adding to deployment list" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/805446 (https://phabricator.wikimedia.org/T309987) (owner: 10Milimetric)
[13:09:53] <elukey>	 btullis: weird :(
[13:28:55] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Beta-Cluster-Infrastructure, 10Event-Platform: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) - https://phabricator.wikimedia.org/T304433 (10Ottomata) 05Open→03Resolved
[13:37:16] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10EventStreams, 10Patch-For-Review: Expose mediawiki/revision/tags-change in stream.wikimedia.org - https://phabricator.wikimedia.org/T294391 (10Ottomata) a:03Ottomata
[13:48:22] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10EventStreams, 10Patch-For-Review: Expose mediawiki/revision/tags-change in stream.wikimedia.org - https://phabricator.wikimedia.org/T294391 (10Ottomata) Getting a strange error when trying to deploy:   ` command "/usr/bin/helm3" exited wi...
[13:49:06] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Add Event Platform timestamp JSONSchema -> Flink type support - https://phabricator.wikimedia.org/T310495 (10Ottomata)
[13:51:49] <milimetric>	 btullis: are we doing the DataHub upgrade?
[13:52:08] <btullis>	 Yeah, I was just about to ping you as well. :-)
[13:52:24] <btullis>	 Just researching how the index rebuild might be done.
[13:52:27] * milimetric checks telepathy booster
[13:52:44] <milimetric>	 k, I can join in cave if you want a partner
[13:57:02] <btullis>	 Mmm. They build a special jar for it, which we don't do at the moment: https://github.com/datahub-project/datahub/blob/master/docker/datahub-upgrade/Dockerfile#L16
[13:57:02] <btullis>	 I wonder if I can just build it on a stat box and run the command from there?
[13:57:52] <btullis>	 OK, give me a few minutes and then let's coordinate in the cave? How did the scheduled run go last night, I've not looked?
[13:58:50] <milimetric>	 failed for a silly path mistake (had leftovers from when the job was in analytics_test)
[13:58:57] <milimetric>	 so I was going to try it again after the upgrade
[13:59:12] <btullis>	 OK, cool.
[14:09:12] <wikibugs>	 (03PS4) 10Milimetric: Add datahub metadata ingestion CLI as a conda env [analytics/refinery] - 10https://gerrit.wikimedia.org/r/792215 (https://phabricator.wikimedia.org/T307714)
[14:09:41] <btullis>	 Here is the deployment-charts CR. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/805826 
[14:12:00] <milimetric>	 btullis: so when do we want to rebuild indices?  Like why are we doing it now?
[14:16:04] <milimetric>	 oh, unrelated, it looks like we're running out of disk space with the Gitlab CI: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/21076
[14:45:15] <btullis>	 milimetric: Could you add a note to this ticket about the GitLab CI issue? https://phabricator.wikimedia.org/T310593 I think it's a recurring problem.
[14:45:33] <milimetric>	 k
[14:46:23] <elukey>	 ottomata: o/ if you have time during the next days can you let me know if https://github.com/wikimedia/ores/pull/361 makes sense? Just to avoid pebcaks :)
[14:46:49] <btullis>	 Oh, maybe we won't need to rebuild the indices because our glossary doesn't currently exist: https://github.com/datahub-project/datahub/releases/tag/v0.8.36 I missed that part of the release notes.
[14:48:23] <btullis>	 !log deploying datahub 0.8.38
[14:48:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:49:38] <milimetric>	 oh yeah, cool "If this is your first time using DataHub Glossaries, you're all set!"
[14:50:43] <btullis>	 Ah, looks like it didn't work on staging. I'll have to look into it again.
[15:06:37] <ottomata>	 elukey:  responded with a NIT, but LGTM!
[15:10:24] <elukey>	 <3
[16:28:38] <ottomata>	 btullis:  milimetric puppet failling on an-launcher1002 i thinkk because https://gerrit.wikimedia.org/r/c/operations/puppet/+/802598
[16:28:48] <ottomata>	 Could not find resource 'File[/usr/local/bin/refinery-sqoop-mediawiki]' in parameter 'require' (file: /etc/puppet/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp, line: 81) on node an-launcher1002.eqiad.wmnet
[16:29:45] <btullis>	 Doh! Thanks. Will look into it now.
[17:14:09] <lmata>	 Hi! it seems that we're seeing some increased logspam from AQS due to some cassandra cluster being down
[17:14:25] <lmata>	 https://grafana.wikimedia.org/d/VCK8-FpZz/cwhite-logstash?orgId=1&refresh=1m&from=now-2d&to=now
[17:15:18] <cwhite>	 https://logstash.wikimedia.org/goto/39388e6839807924b6e43cda87487fda
[17:15:39] <lmata>	 cwhite: thanks!
[17:16:24] <btullis>	 Hi lmata - thanks for this. Pinging urandom: as well, since he has been helping to bootstrap this new cluster and start migrating data.
[17:17:11] <lmata>	 btullis: ty!
[17:20:31] <btullis>	 Yeah, it looks like all of the logs are coming from aqs2* hosts, which are not yet properly in service. They're downtimed in Icinga, but I hadn't thought about how to disable their log shipping. 
[17:22:04] <btullis>	 ottomata: joal: That puppet issue on an-launcher1002 is fixed now, I believe.
[17:25:15] <btullis>	 lmata, cwhite - looks like it has subsided now. Would you agree?
[17:26:16] <ottomata>	 btullis:  ty
[17:58:15] <lmata>	 btullis: ty!
[18:07:28] <icinga-wm>	 PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.113 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos
[20:13:17] <milimetric>	 thanks for cleaning up my mistake Ben!
[20:27:01] <icinga-wm>	 PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.069 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos
[20:50:10] <wikibugs>	 10Data-Engineering, 10Foundational Technology Requests, 10Product-Analytics: "Source of truth" dataset for pageviews - https://phabricator.wikimedia.org/T310732 (10DAbad) a:03EChetty
[21:01:30] <btullis>	 milimetric: a pleasure. I should have spotted it before +2 ing it, but there were go.
[21:02:11] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:18:32] <btullis>	 I'll make a ticket for this aqs1008.mgmt interface flapping. It's noisy on this channel since the Icinga change I made recently.
[21:35:06] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson)
[22:46:12] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:06:18] <icinga-wm>	 RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.998 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos