[00:26:16] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) I'm still making good progress on this, but it's not quite there yet. I fixed the issue with the... [00:27:45] (03PS4) 10Btullis: Upgrade superset to verstion 1.5.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/865609 (https://phabricator.wikimedia.org/T323458) [06:28:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4040 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4040%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:33:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4040 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4040%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:43:21] 10Data-Engineering-Planning, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests, 10WMF-Communications: Grant Access to staff LDAP group for Sbenchagra - https://phabricator.wikimedia.org/T324696 (10RhinosF1) >>! In T324696#8462191, @Varnent wrote: > @jhathaway - Apologies - have added links to that tem... [09:13:02] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10dcausse) >>! In T324576#8457473, @Ottomata wrote: > Q for @dcausse and @gmodena. > > I've thu... [09:38:56] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) Don't you just love bumping into your colleagues on the Internet? In searching for a resolution to... [09:44:15] (03PS4) 10Aqu: Extract to a NameNode the creation of the raw FSImage [analytics/refinery] - 10https://gerrit.wikimedia.org/r/867185 (https://phabricator.wikimedia.org/T324850) [10:37:32] (03PS5) 10Aqu: Extract to a NameNode the creation of the raw FSImage [analytics/refinery] - 10https://gerrit.wikimedia.org/r/867185 (https://phabricator.wikimedia.org/T324850) [11:24:57] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) OK, this problem was previously observed by Razzi whilst upgrading to version 1.4.2 It was solved... [11:37:57] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) I believe that https://superset-next.wikimedia.org/ is now ready for testing. [11:53:50] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) There is one repeated warning in the superset log that will be worth addressing, but doesn't seem... [11:55:38] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) I'm moving this ticket to in-review whilst it undergoes user acceptance testing. I'll draft an ac... [12:01:48] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10gmodena) >>! In T324576#8457473, @Ottomata wrote: > I think I'd prefer not to write log files... [12:46:52] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10akosiaris) >>! In T324576#8463074, @gmodena wrote: >>>! In T324576#8457473, @Ottomata wrote: >... [14:54:39] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Clement_Goubert) [14:55:03] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) [14:59:01] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Ottomata) > reverting to the state before the incident Hm, do we need to revert? I don't mind eith... [15:06:14] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics (Kanban): Add mediawiki_web_ab_test_enrollment to the allowlist - https://phabricator.wikimedia.org/T323664 (10MNeisler) 05Open→03Resolved [15:06:46] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) >>! In T324994#8463585, @Ottomata wrote: >> reverting to the state before the inci... [15:51:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10Ottomata) > unless you count SQL Hints Are these so bad? Could we default to the latest version for both sources and... [15:59:29] (03CR) 10Snwachukwu: [WIP] Refactor and Expand External referer classification (0310 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [16:02:57] 10Data-Engineering, 10Wikibase change dispatching scripts to jobs, 10serviceops-radar, 10Platform Team Workboards (Platform Engineering Reliability): Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10jijiki) [16:04:13] (03PS6) 10Snwachukwu: [WIP] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [16:09:21] (03CR) 10CI reject: [V: 04-1] [WIP] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [16:15:14] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) I tried a reinstall of kafka-stretch2002 with slightly different RAID controller settings, but that didn't work eithe... [16:21:07] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Ladsgroup) Hi, The flood of logs is still incoming, the revert of logspam has not been deployed yet... [16:44:54] actually, btullis [16:45:09] i think that is not the right task either. i'm pretty sure kafka-stretch200[12] are fine. https://phabricator.wikimedia.org/T314160 [16:45:14] I know, that's also the wrong ticket. :-) [16:45:19] Moving it as we speak. [16:45:22] right but i think you are reimaging the wrong hosts too? [16:45:46] It's kafka-stretch2002 wasn't right though. [16:46:17] oh? i thought its new kafka-jumbos in eqiad, and kafka-stretch100[12] that are wrong? [16:46:22] according to phab tickets anyway [16:47:02] I think that there are several issues affecting all new servers with H750 and the disk setup we have in kafka boxes. [16:47:02] robh and papaul did kafka-stretch200[12] (and fixed the /dev issues?) in https://phabricator.wikimedia.org/T314160#8166665 in august [16:47:11] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Airflow configuration file in puppet to be compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10Antoine_Quhen) a:05Snwachukwu→03Antoine_Quhen [16:47:37] 2001 was ok, 2002 was broken in that root was still installed on /dev/sdb [16:47:41] ah hm okay [16:47:59] okay so https://phabricator.wikimedia.org/T314160 is for codfw stretch. [16:48:11] and afaik new eqiad jumbos and stretch are also not working [16:49:11] anyway, your comment that you moved from the jumbo ticket acutally belongs on the codfw stretch ticket, which is https://phabricator.wikimedia.org/T314160, not https://phabricator.wikimedia.org/T314156#8463875 (my bad for giving you the wrong ticket before) [16:49:25] Yeah, but it's all the same problem, I believe. 2002 was installed, but it wasn't right. I should have put my comment against the codfw ticket, even though it was resolved. [16:49:27] thanks for working on it! [16:49:30] okay. [16:49:41] I have too many phab tabs open :-) [16:50:30] (03CR) 10Milimetric: "nothing blocking, for me just some style thoughts you may or may not want to incorporate, but looks good." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 (owner: 10Nmaphophe) [16:50:34] Classic Ben comment here: https://phabricator.wikimedia.org/T297913#8041258 [16:50:34] > I'll update https://wikitech.wikimedia.org/wiki/Raid_setup with this information. [16:50:34] ...but I never did. [17:02:53] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) Hello, just FYI I reimaged kafka-stretch2002 because the `/dev/sda` and `/dev/sdb` were the wrong way around. {F35861... [17:08:37] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) I should have written the comment above on the kafka-stretch ticket for codfw(T314160) despite the fact that it was r... [17:27:41] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmn... [17:43:38] 10Data-Engineering, 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Clement_Goubert) Yes, the commit message of the above changelog makes it very clear it is not to be... [18:21:38] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet w... [18:30:07] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @JMeybohm re needed ingress and egress. **Ingress**: I don't think we //need// anyt... [18:55:37] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) OK, I cleaned up the failed bit of DHCP automation that was causing the cookbook to fail on kafka-stretch2001. Now we... [19:49:57] 10Data-Engineering-Planning, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests, 10WMF-Communications: Grant Access to 'wmf' LDAP group for 'Sbenchagra' - https://phabricator.wikimedia.org/T324696 (10Varnent) [19:50:07] 10Data-Engineering-Planning, 10DC-Ops, 10Event-Platform Value Stream, 10SRE, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @btullis yes, if you want to recreate the raid manually then please do. [20:13:29] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05): Flink wrappers and helper libraries should be moved into a dedicated git repo with packaging and CI. - https://phabricator.wikimedia.org/T324746 (10Ottomata) Thought: it would just so nice if we could run the same enrichment logic for backfilling,... [20:15:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), 10User-brennen, 10Wikimedia-production-error: EventBus: Error: Call to a member function isCurrent() on null - https://phabricator.wikimedia.org/T323294 (10Ottomata) 05Open→03Resolved a:03Ottomata [20:46:19] 10Data-Engineering-Planning, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests, and 2 others: Grant Access to 'wmf' LDAP group for 'Sbenchagra' - https://phabricator.wikimedia.org/T324696 (10jhathaway) added! [21:02:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10tchin) > Are these so bad? Could we default to the latest version for both sources and sinks, but allow SQL hints to o... [21:15:46] (03CR) 10Aqu: [V: 03+2 C: 03+2] Extract to a NameNode the creation of the raw FSImage [analytics/refinery] - 10https://gerrit.wikimedia.org/r/867185 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [21:33:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:38] !log Deploying analytics/refinery (HDFS FSImage conversion to XML script) [21:35:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:48:55] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Integrate Flink Table API in eventutils-python - https://phabricator.wikimedia.org/T324953 (10Ottomata) Spent a little bit of time thinking about this today, and I'm not sure how it will work. You've been able to workaround some of the annoyi... [22:06:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state