[00:03:37] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10andrea.denisse) Hello @Gehel , do you approve @Dcausse access to the `analytics-admins` group ? [00:20:07] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:17] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [00:41:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5029 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5029%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [01:07:24] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) - ~~Wait until wmfdata 2.0 is released (T300442). (Target is Wed Nov 23)~~ -... [06:21:24] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Gehel) I approve! [06:40:30] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10Marostegui) p:05High→03Unbreak! I think there are many things broken already that this deserves an UBN [07:33:21] (03CR) 10Krinkle: [C: 03+2] painttiming: Add missing action and namespace. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/860078 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [07:33:26] (03PS2) 10Krinkle: painttiming: Add missing action and namespace [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/860078 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [07:33:53] (03CR) 10Krinkle: [C: 03+2] painttiming: Add missing action and namespace [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/860078 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [07:34:28] (03Merged) 10jenkins-bot: painttiming: Add missing action and namespace [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/860078 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [07:40:41] (03CR) 10Gergő Tisza: [C: 03+1] Add user new impact data to the impact homepagemodule (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) (owner: 10Sergio Gimeno) [07:40:54] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10elukey) Great work! I found this alert in icinga, under puppemaster1001 -> Puppet CA expired certs: ` crit: kafka_broker_kafka-jumbo1001 kafka_... [07:59:50] (03CR) 10Gergő Tisza: [C: 03+1] Add user new impact data to the impact homepagemodule (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) (owner: 10Sergio Gimeno) [09:24:14] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) I see some immediate actions: * (a) Remove the ipv6 entry in netbox + refresh DNS * (b) Make mariadb listen on ip4 + ip6 (that probably... [09:44:46] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10Marostegui) Thanks for the ideas: a) Sounds good b) That won't solve it, as it is not a problem on the IPs but on the grants, which is not so... [09:52:45] 10Data-Engineering, 10SRE, 10SRE-Access-Requests: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) a:03BTullis [09:57:05] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) Yep, we noticed the same issue with `b` xd, thanks to a comment in the hiera setting, going for `a`. @Marostegui we should change all... [09:57:31] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10Marostegui) Yes, all of them are having the same issue. [10:26:03] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) Removed all the AAAA entries for clouddb* servers, mainain-dbusers now works as expected [10:26:55] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) p:05Unbreak!→03High With a workaround in place we can move to high again to continue the investigation. [10:27:09] 10Data-Engineering, 10Data-Services, 10User-dcaro, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) [10:27:12] 10Data-Engineering, 10Data-Services, 10User-dcaro, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) 05Open→03In progress [10:27:22] 10Data-Engineering, 10Data-Services, 10User-dcaro, 10cloud-services-team (Kanban): clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) a:03dcaro [10:27:25] 10Data-Engineering, 10Data-Services, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) [10:27:31] 10Data-Engineering, 10Data-Services, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) [10:29:36] 10Data-Engineering, 10Data-Services, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) Summary of the actions taken: * Removed the `DNS Name` from netbox for all the ip6... [10:30:50] 10Data-Engineering, 10Data-Services, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10dcaro) The task where the records were changed is T312557 [10:31:57] 10Data-Engineering, 10Data-Services, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10Marostegui) So from my side it is all fine. We should also double check with @taavi and @bd... [10:33:59] 10Data-Engineering, 10Data-Services, 10Cloud-Services-Origin-User, 10Cloud-Services-Worktype-Unplanned, and 2 others: clouddb* hosts with ipv6 access timeout from cumin - https://phabricator.wikimedia.org/T323550 (10taavi) 05In progress→03Resolved [10:55:41] (03PS3) 10Sergio Gimeno: Add user new impact data to the impact homepagemodule [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) [11:07:33] (03PS4) 10Sergio Gimeno: Add user new impact data to the impact homepagemodule [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) [11:10:25] (03CR) 10Sergio Gimeno: Add user new impact data to the impact homepagemodule (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) (owner: 10Sergio Gimeno) [11:14:06] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) 05Open→03Resolved @dcausse, @gmodena - Welcome to the `analytics-admins` group! Please take suitable care with your... [11:45:43] (03CR) 10Kosta Harlan: [C: 03+2] Add user new impact data to the impact homepagemodule [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) (owner: 10Sergio Gimeno) [11:46:13] (03Merged) 10jenkins-bot: Add user new impact data to the impact homepagemodule [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) (owner: 10Sergio Gimeno) [12:18:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:58] ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Added more downtime, but replacement batteries are on their way https://wikitech.wikimedia.org/wiki/MegaCli%23M [13:16:58] ng [13:25:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) Thanks @elukey. I'm taking a look at this now. It's interesting because I found this: https://wikitech.wikimedia.org/wiki/Puppet#Renew_... [13:42:43] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10elukey) @BTullis I think that most of the above are certs that we don't use anymore, like: ` elukey@puppetmaster1001:~$ sudo ls /var/lib/puppet... [13:48:10] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) Great, I was coming to that conclusion too. The fact that it only lists six kafka jumbo brokers by name (`kafka_broker_kafka_jumbo100[1... [13:53:31] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) Removed unused and expiring kafka_jumbo certificates. ` btullis@puppetmaster1001:/var/lib/puppet/server/ssl/ca/signed$ sudo puppet cert... [13:55:39] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) ` btullis@puppetmaster1001:/var/lib/puppet/server/ssl/ca/signed$ sudo puppet cert clean kafka_client_test1 Warning: `puppet cert` is de... [14:02:24] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) I have confirmed the varnishkafka client certificate expiry date on a cp host. ` btullis@cp1075:/etc/varnishkafka/ssl$ cat varnishkafka... [14:12:47] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10elukey) >>! In T323697#8419896, @BTullis wrote: > Also you've listed `kafka_jumbo-eqiad_broker.pem` - Just double-checking, that one we **do**... [14:17:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) >> And yes, the `varnishkafka` one jumped out at me too. > > We use it to authenticate varnishkafka to jumbo since only the `varnishk... [14:20:52] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10elukey) +1, the varnishkafka cert is another good candidate for PKI in my opinion, but very out of scope I know :) [14:33:30] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) [14:36:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5020 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5020%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:37:18] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) I have created a new ticket for the varnishkafka certificate renewal here: {T323771} It might be a good one for @Stevemunene to work on... [14:41:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5020 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5020%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:44:33] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 (10BTullis) p:05Triage→03High [15:30:25] !log Started deployment of refinery as part of weekly deployment train [15:30:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:11:57] !log killed webrequest-druid-hourly-coord for restart as part of weekly deployment train. [16:11:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:13:10] !log successfully restarted webrequest-druid-hourly-coord for restart as part of weekly deployment train. [16:13:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:15:43] !log killed webrequest-druid-daily-coord for restart as part of weekly deployment train. [16:15:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:21:33] !log restarted webrequest-druid-daily-coord as part of weekly deployment train. [16:21:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:38:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp5019 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5019%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:43:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5019 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5019%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:52:15] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10Volans) Thanks for opening this, I'd like too to see superset upgraded. We have a couple of SRE dashboards that when setting 3~4 filters hit the max URI lengh... [17:15:04] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) p:05Triage→03High [17:15:40] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Update kafka-jumbo certificates - https://phabricator.wikimedia.org/T323697 (10BTullis) 05Open→03Resolved [17:27:20] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) [17:27:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) [18:20:51] (03CR) 10Gergő Tisza: Add user new impact data to the impact homepagemodule (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/859113 (https://phabricator.wikimedia.org/T323160) (owner: 10Sergio Gimeno) [19:43:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:44:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:48:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:49:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:47:59] (03PS5) 10Aqu: Put wikihadoop into refinery/source [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/856530 (https://phabricator.wikimedia.org/T321168) [21:48:01] (03PS13) 10Aqu: Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) [22:45:07] (03CR) 10Aqu: "I still have little code optimization to do (cleaning up checkpoint dir, calculating parallelization through provided variables, ...). But" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu)