[04:13:53] 10Data-Engineering, 10Pageviews-API, 10Pageviews-Anomaly: Pageviews data dumps are not being created - https://phabricator.wikimedia.org/T326559 (10Aquameta) 05Open→03Resolved a:03Aquameta They're back! Thanks. <3 <3 <3 [08:23:02] (03Abandoned) 10Joal: Add gor.wiktionary to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/874831 (https://phabricator.wikimedia.org/T326139) (owner: 10Gerrit maintenance bot) [08:24:15] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deplo" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/875328 (https://phabricator.wikimedia.org/T326236) (owner: 10Gerrit maintenance bot) [08:58:52] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Gitlab CI pipeline for Pytthon applications should bundle Java eventutilities and runtime deps - https://phabricator.wikimedia.org/T326567 (10gmodena) [08:59:33] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Gitlab CI pipeline for Pytthon applications should bundle Java eventutilities and runtime deps - https://phabricator.wikimedia.org/T326567 (10gmodena) a:03gmodena [09:01:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:06:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:59:28] (03CR) 10Aqu: [V: 03+2 C: 03+2] Create adhoc log4j.properties for quieter Spark logs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868754 (https://phabricator.wikimedia.org/T302500) (owner: 10Aqu) [10:19:01] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet wi... [10:24:19] !log roll-rebooting the druid-public cluster to pick up new kernel [10:24:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:24:57] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmne... [10:35:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10JAllemandou) [10:59:52] PROBLEM - Host an-worker1080 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:31] * btullis !log roll-rebooting the analytics druid cluster to pick up new kernel [11:36:38] !log roll-rebooting the analytics druid cluster to pick up new kernel [11:36:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:33:27] 10Analytics-Radar, 10Machine-Learning-Team: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10LSobanski) [12:34:47] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet wi... [12:36:10] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmne... [12:39:05] 10Data-Engineering, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10LSobanski) [12:59:34] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet wi... [12:59:52] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1002.eqiad.wmne... [13:24:18] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:32:49] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10jbond) @BTullis Seems you have allready gone through most of the issues i went through. Some addtional things to mention... [13:35:56] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:36:57] hey folks. looks like you have new servers named 'cephosdXXXX'? I fear that name is going to be very confusing at some point since we already have ceph in use in WMCS.. so if at all possible please consider adding some clarification to the host name, or even just an-cephosdXXXX instead of plain cephosd [13:47:06] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1002.eqiad.wmnet wi... [13:56:30] Hi taavi. Thanks for bringing it up. I am aware that you've got cloudcephosd* machines in use in WMCS. I was originally going to refer to them as 'dse-cephosd*' servers, but I was advised to drop the prefix by [13:58:18] ...Faidon, before he left. I'm not particularly comfortable with using the `an-` prefix, now that this team is now longer called Analytics and these servers aren't going to be dedicated to 'analytics' work. [13:59:47] an- was just an example [13:59:54] interesting. what will they be used for then instead? [14:02:59] Well, it's intended to be more of an inclusive thing, so the `dse-`prefix of data science and engineering, which includes analytics, research, ml etc. I was advised to drop the prefix to make it more inclusive for future requirements. [14:07:25] so is the plan to make it a generic ceph cluster for basically everyone? [14:07:45] just to clarify, my concern is that plain 'cephosd' at least to me implies that it's the only ceph cluster around [14:08:03] yeah, there are a bunch of production use cases that I'm hoping could use this cluster instead of burdening the DBs [14:10:03] swift solved this quite nicely by using 'moss' (= misc object storage service) for the generic cluster, wonder if something like that could be coined up here [14:10:24] or just ignore it for now, although this seems like something that is much easier to fix now than when it's in active use [14:30:56] heya joal, I joined, if you're available, we can talk airflow druid loading, but I think today is your busy day no? lmk :] [14:32:49] In meeting now, then kids - after standup? [14:33:13] mforns: Heya - please excuse my rudeness - see the message just above :) [14:33:55] what rudeness? xD ofc, let's talk after standup! [14:35:28] mforns: not even saying hello :) [14:36:25] no problemo :] [14:38:08] 10Data-Engineering-Planning, 10API Platform (Sprint 02), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: Obtain security review of uniqueDevices - https://phabricator.wikimedia.org/T320976 (10Atieno) Our security review request has made it to this quarters' work. Priority has been set to medium [14:43:12] 10Data-Engineering-Planning, 10API Platform (Sprint 02), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: Obtain security review of uniqueDevices - https://phabricator.wikimedia.org/T320976 (10Atieno) For Legal they are not sure if they should review so waiting on Security to ping them on if they... [14:51:56] taavi: btullis can always just use a cluster name that doesn't have to have a squatted meaning. e.g. 'jumbo' :p [14:52:48] https://www.merriam-webster.com/thesaurus/jumbo :p [14:52:59] hefty-ceph [14:53:06] :) [14:54:19] I'd love if we could name it gargantua :) [15:03:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks, 10Patch-For-Review: Create PageUndeleteComplete hook, analogous to PageDeleteComplete - https://phabricator.wikimedia.org/T321412 (10Reedy) [15:55:43] !log reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10. [15:55:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:58:52] !log backfilling refine_event_sanitized_analytics_immediate on an-launcher1002 ‘sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_analytics_immediate —ignore_failure_flag=true --since=2023-01-07T17:00:00 until=2023-01-08T10:00:00 [15:58:52] table_include_regex="mediawiki_reading_depth|mediawiki_ipinfo_interaction|mediawiki_wikistories_consumption_event|mediawiki_content_translation_event|mediawiki_skin_diff|mediawiki_web_ab_test_enrollment" --verbose’ [15:58:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:10:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Metrics-Platform-Planning, 10Patch-For-Review: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10EChetty) [16:14:21] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10Stevemunene) @Ottomata some changes were needed... [16:36:03] 10Data-Engineering, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Danilo) I am interesting in make tools with those data, but I am not familiar with the analytics infrastructure, I am mor... [16:55:43] (VarnishKafkaDeliveryErrors) firing: (5) varnishkafka has cache_text errors on cp5017:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [16:55:47] (VarnishKafkaDeliveryErrors) firing: (7) varnishkafka has cache_upload errors on cp5025:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [16:55:51] (VarnishKafkaDeliveryErrors) firing: (5) varnishkafka has cache_text errors on cp5017:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [16:55:55] (VarnishKafkaDeliveryErrors) firing: (7) varnishkafka has cache_upload errors on cp5025:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [17:02:46] ^ with regard to these errors, eqsin has just been depooled due to a connectivity issue. Check in #wikimedia-operations for more info. [17:03:36] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_upload errors on cp5025:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [17:03:40] (VarnishKafkaDeliveryErrors) firing: (8) varnishkafka has cache_text errors on cp5017:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [17:03:45] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_upload errors on cp5025:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [17:10:41] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_text errors on cp5017:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [17:10:45] (VarnishKafkaDeliveryErrors) resolved: (8) varnishkafka has cache_text errors on cp5017:9132 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishKafkaDeliveryErrors [17:11:20] 10Data-Engineering-Planning, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 (10Bennylin) Hi, can we get this done? It's been blocking T318506 for months now [17:12:16] btullis: hi! can you help me please if you're there? I need someone with permissions :) [17:12:32] I'm here. How can I help? [17:13:59] 10Data-Engineering-Planning, 10DBA, 10Data-Services, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 (10BTullis) a:03BTullis [17:16:30] mforns: I'm here. [17:16:38] hi! :] [17:17:52] 10Data-Engineering-Planning, 10DBA, 10Data-Services, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 (10BTullis) Sincere apologies. This slipped between the cracks. I will prioritize it now. [17:23:46] mforns: I'm out of that meeting now, do you want to batcave or huddle or is here fine? [17:23:57] btullis: I'm trying something, one sec :] [17:28:34] btullis, I was with joal in a meet, and he helped me solve that, it's good now! [17:28:46] btullis: thanks anyway!! [17:33:29] Great! [17:33:57] !log chassis power reset on an-worker1032 (T326459) [17:33:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:34:00] T326459: an-worker1132 down - https://phabricator.wikimedia.org/T326459 [17:35:13] 10Data-Engineering: an-worker1132 down - https://phabricator.wikimedia.org/T326459 (10BTullis) I checked the console and there is no output, but `ipmitool` reports that the chassis is still powered. I issued a `chassis power reset` from ipmitool. ` btullis@cumin1001:~$ ipmitool -I lanplus -H "an-worker1132.mgmt... [17:36:16] RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:36:18] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for aswikiquote - https://phabricator.wikimedia.org/T321294 (10BTullis) a:03BTullis [17:36:39] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 (10BTullis) [17:37:32] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:05] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for shnwikibooks - https://phabricator.wikimedia.org/T321256 (10BTullis) a:03BTullis [17:38:18] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for guwwikiquote - https://phabricator.wikimedia.org/T321288 (10BTullis) a:03BTullis [17:39:38] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for gorwiktionary - https://phabricator.wikimedia.org/T326138 (10BTullis) a:03BTullis [17:40:01] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for guwwiktionary - https://phabricator.wikimedia.org/T309056 (10BTullis) a:03BTullis [17:40:58] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for pcmwiki - https://phabricator.wikimedia.org/T310879 (10BTullis) a:03BTullis [17:41:18] 10Data-Engineering, 10DBA: Prepare and check storage layer for blkwiki - https://phabricator.wikimedia.org/T310872 (10BTullis) a:03BTullis [17:52:03] 10Data-Engineering: an-worker1132 down - https://phabricator.wikimedia.org/T326459 (10BTullis) 05Open→03Resolved a:03BTullis [17:58:49] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10fnegri) We had similar issues with `cloudcephosd*` hosts, where the device name would change on reboot, and we sometimes... [18:00:04] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Unique Devices service - https://phabricator.wikimedia.org/T288298 (10JArguello-WMF) [18:00:26] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Pageviews Service - https://phabricator.wikimedia.org/T288296 (10JArguello-WMF) [18:12:09] 10Data-Engineering-Planning, 10API Platform (Sprint 02), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: Obtain security review of uniqueDevices - https://phabricator.wikimedia.org/T320976 (10JArguello-WMF) [18:24:14] 10Data-Engineering, 10API Platform (Sprint 03), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) [18:30:09] 10Data-Engineering, 10API Platform (Sprint 03), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) a:05codebug→03Emeka-okechukwu [18:44:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-Core-Hooks, 10Patch-For-Review: Create PageUndeleteComplete hook, analogous to PageDeleteComplete - https://phabricator.wikimedia.org/T321412 (10Ottomata) Thanks. FYI, there are some ideas and preferences to refactor these core hooks t... [19:46:05] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Zabe) [20:57:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5031 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:02:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5031 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:14:26] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Gitlab CI pipeline for Pytthon applications should bundle Java eventutilities and runtime deps - https://phabricator.wikimedia.org/T326567 (10gmodena)