[01:11:59] (03CR) 10Clare Ming: [C: 03+1] Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [06:39:51] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [06:45:28] 10Data-Platform-SRE: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [06:45:31] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) 05Open→03Resolved [06:49:02] * brouberol waves good morning [06:59:56] I have a tiny MR that should make it a bit easier for ops to rebalance kafka partitions based on their size: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965787 [07:05:30] as well as a MR removing kafka-jumbo100[1-6].eqiad.wmnet from the broker list of apps running in k8s https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965164 [07:05:50] good morning brouberol [07:05:58] good morning! [07:07:25] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10Addshore) > The energy savings is possibly unclear, at least under current case (but that's partly because it's hard to know how much energy is... [07:09:21] (03CR) 10Aqu: [V: 03+2 C: 03+2] Update scap deployment in Hadoop test [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/961390 (owner: 10Aqu) [07:14:11] Hello, I'm trying to the refinery deployment process on the Hadoop test cluster https://phabricator.wikimedia.org/T347491 [07:55:26] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [08:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:17] aqu: Great. Let us know if it's fixed after the manual installation of git-fat [08:05:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:35] btullis: If I see changes that aren't mine in `puppet-merge`, the etiquette is to flag it and wait, isn't it? [08:16:11] Yes, flag it in #wikimedia-sre generally. [08:16:27] in my cases, it's new secrets being added to modules/secret/secrets/pki/ [08:17:14] done, thanks [08:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:53] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [08:31:55] I'm going to deploy the broker list change to our apps in k8s: datahub, eventgate-analytics, eventgate-analytics-external, eventstreams-internal, mw-page-content-change-enrich [08:32:14] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Scap deployment on Hadoop test cluster broken - https://phabricator.wikimedia.org/T347491 (10Antoine_Quhen) 05Open→03Resolved Thanks @BTullis ! I've tested it. It's now resolved! [08:32:17] brouberol: ack. [08:32:32] cc joal, wrt mw-page-content-change-enrich [08:34:25] !log redeploying eventstreams-internal with the new kafka broker list T336044 [08:34:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:34:28] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [08:38:19] !log redeploying eventgate-analytics with the new kafka broker list T336044 [08:38:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:39:13] joal: Are you happy for me to go ahead and push out the new version of presto to production for T342343 ? [08:39:14] T342343: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 [08:42:03] !log redeploying eventgate-analytics-external with the new kafka broker list T336044 [08:42:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:42:06] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [08:42:15] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Install jupyterhub separately from conda-analytics - https://phabricator.wikimedia.org/T321512 (10Antoine_Quhen) +1 for the simplicity induced in the `conda-analytics` package. [08:46:15] (EventgateValidationErrors) firing: ... [08:46:16] eventgate-analytics-external stream eventlogging_MobileWebUIActionsTracking validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationError [08:46:53] btullis: would you mind checking the output of the diff for datahub deploy? I'm seeing the env var DATAHUB_REVISION going from 3 to 1 on datahub-main-system-update-job. Thanks! [08:48:06] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) [08:48:10] brouberol: That's fine, you can proceed with that. It's something to do with running migrations. We can work it out precisely when we move datahub to dse-k8s. For now, you can just apply the change. [08:48:47] :+1 noted [08:49:12] !log redeploying datahub with the new kafka broker list T336044 [08:49:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:49:15] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [08:51:15] (EventgateValidationErrors) firing: ... [08:51:16] (4) eventgate-analytics-external stream eventlogging_DesktopWebUIActionsTracking validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidation [08:55:38] brouberol: o/ did you see --^ [08:55:43] ^ these are event schema validation errors. Does anyone know anything about that? They line up with the deployment, but I don't know if they were _caused_ by the deployment or if that was some kind of silent change [08:55:46] it matches with your deplloyments [08:55:53] okok :) [08:55:57] what was the diff in helmfile? [08:56:10] the change only removed kafka-jumbo100[1-6] from the broker list [08:56:14] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) @Ahoelzl , as a complement to T297231 , we could get from those Airflow metrics: * monitoring/alerting of the global number of failur... [08:57:58] I'm looking at the app logs [08:57:59] brouberol: okok, if you want to see what's wrong quickly, you can jump on a jumbo node and type [08:58:03] kafkacat -b localhost:9092 -t eqiad.eventgate-analytics-external.error.validation | jq '.' [08:58:22] "message": "'.event.font' should be string", [08:58:45] Yep, I was seeing the same thing at https://logstash.wikimedia.org/app/discover#/view/AXMlVWkuMQ_08tQas2Xi?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=h@c2dd4ef [08:58:59] so what I am wondering is if the restart of the pods triggered a refresh of the schemas [08:59:11] so I'd say that is a silent change triggered by a pod restart [09:01:43] is there anything I can help with elukey? [09:02:55] could be https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/960141 [09:03:41] okok so I think that they merged before the mediawiki train went out [09:03:47] and the restart triggered [09:05:01] brouberol: I'll follow up in the task, in theory we could revert the change and restart the pods [09:05:04] but not sure [09:05:05] should we try to roll ^ this back? [09:05:29] gotcha. Let me know how I can help, should you need it [09:09:02] brouberol: commented in https://phabricator.wikimedia.org/T346106#9252934 [09:09:55] Thanks brouberol & elukey to take care of EventgateValidationErrors . I'm marking the ack in the mailing list to keep track in our OpsWeek. [09:10:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:26] o/ [09:15:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:30] phuedx: so Olga answered and told me that the engineering team is not yet online [09:19:22] 10Data-Engineering, 10Data Products, 10Structured-Data-Backlog: Bump memory to enable large artifacts sync on HDFS - https://phabricator.wikimedia.org/T348958 (10mfossati) [09:20:34] elukey: Revert of the instrument: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/965900 [09:20:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:38] Is anyone around to deploy? I haven't done one in a while [09:21:50] I can do that [09:22:14] (once the revert is merged) [09:22:20] brouberol: that is a mediawiki deployment :) [09:22:40] phuedx: wait a second, I am not getting the full picture [09:22:47] oh, in that case, I'll refrain [09:23:22] phuedx: is the mw change for the next schema out, and issuing the error? [09:23:32] or are we waiting for a change in the train to fix this? [09:23:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:36] because it seems as if the current mediawiki code is missing the "font" field [09:24:39] elukey: Sorry. Yes. The instrument that is producing the invalid events is out in production. It was merged 12 days ago, which means it should have ridden the train last week [09:25:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:13] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10Ladsgroup) I think we are done with s2. Is that correct @ABran-WMF ? [09:26:14] The instrument has a bug in it – it uses a function (which I'm not familiar with) to check if a font setting is enabled or disabled. That function returns a string or a boolean (false). My guess is that if it returns false, then the value shouldn't be included in the event but, currently, it is [09:26:28] ahhh okok yes [09:26:34] I see the version schema change in the revert [09:26:47] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10ABran-WMF) We still need to update source servers [09:27:28] I'm familiar with the instrument but not the new additions so I'd prefer to revert rather than hotfix [09:27:46] There's another bug two but that appears to be coming from a different codebase. I'll create a follow up task about that [09:27:49] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10Ladsgroup) >>! In T343109#9252998, @ABran-WMF wrote: > We still need to update source servers I think we did that. Maybe run a --check on --dc-masters? [09:27:50] phuedx: yes definitely, ideally if anybody was online they could hot fix it [09:28:09] (There are ~3,000 event validation errors about the is_page_previews_enabled property being set to null) [09:28:44] +1ed, I can in theory help deploying, it should be a scap backport blabla change [09:28:46] Anyway, sorry for not explaining my thinking. I got a little carried away, which I'll blame on being overly caffeinated :D [09:28:47] but I need to verify [09:28:53] nono thanks! [09:28:57] I wanted to double check :) [09:29:12] following up with releng/sre [09:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:52] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10ABran-WMF) >>! In T343109#9252999, @Ladsgroup wrote: > > I think we did that. Maybe run a --check on --dc-masters? you're right! my bad [09:33:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [09:41:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:14] 10Data-Engineering, 10Data Products, 10Structured-Data-Backlog: DagProperties don't automatically update Airflow variables - https://phabricator.wikimedia.org/T348963 (10mfossati) [09:46:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:07] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) I can confirm Airflow variables are not updated after deployment. Opened {T348963}, CC @xcollazo . [09:49:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:42] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10gmodena) [09:55:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [09:57:15] elukey: The validation error rate is declining steadily [09:58:52] super [10:00:51] I'm going ahead with the deployment of presto version 0.283 to production - unless anyone objects in the next few minutes... :-) [10:01:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:04] !log deploying presto version 0.283 to production for T342343 with `sudo debdeploy deploy -u 2023-10-12-presto.yaml -Q 'P{O:analytics_cluster::presto::server} or P{O:analytics_cluster::coordinator} or A:stat'` [10:06:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:06:07] T342343: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 [10:06:16] (EventgateValidationErrors) firing: ... [10:06:16] (4) eventgate-analytics-external stream eventlogging_DesktopWebUIActionsTracking validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidation [10:08:12] (SystemdUnitFailed) firing: (10) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:57] (SystemdUnitFailed) firing: (17) presto-server.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:35] elukey: I'm not quite following why the errors only started appearing this morning. The patch would have been deployed on last week's train [10:12:46] You said you'd restarted EventGate? [10:13:12] (SystemdUnitFailed) firing: (17) presto-server.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:28] ^looking at this presto failure [10:15:26] systemd seems happy with it on an-coord1001. Restarted 5 minutes ago, as per the deploy. [10:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:10] phuedx: not me but there was an unrelated deployment of eventgate-analytics-external, that I believe forced the new schema repository version to be picked up [10:16:16] (EventgateValidationErrors) firing: ... [10:16:16] (4) eventgate-analytics-external stream eventlogging_DesktopWebUIActionsTracking validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidation [10:16:22] phuedx: I am puzzled as well [10:17:00] phuedx: the only explanation that I can give to me is that eventgate accepts events of schema that have a higher version [10:17:04] without validation etc.. [10:18:13] (SystemdUnitFailed) resolved: (17) presto-server.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:12] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) This looks good. All presto servers and client deployed. ` btullis@stat1009:~$ presto --catalog analytics_hive Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 presto> SELECT node_... [10:22:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:13] (SystemdUnitFailed) firing: (17) presto-server.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:32] phuedx: one weird thing is that I still see some events like ""message": "'.event.font' should be string"," [10:24:41] I guess static assets cached? [10:24:59] elukey: Yes [10:25:09] okok lovely [10:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:12] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:54] I'm redeploying datahub in eqiad. [11:15:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:01] joal: would it be ok to redeploy mw-page-content-change-enrich in the next hour? I have a kafka broker topology config change to deploy. [11:46:03] 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) Updating the rollout plan as discussed - Disable puppet on druid100[4-6] and druid10[09-11] and an-launcher1002 - Manually stop zookeeper on druid1004 (This being the host w... [11:52:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10dr0ptp4kt) Good question - I meant the contrast with respect to the .ttl.gz dumps and everything that goes into munging and importing (in aggreg... [12:02:37] This is probably documented somewhere and I'm probably just not asking Phab / Wikitech web search correctl - so apologies! - but anyone know if there's a quick way to obtain stats per-file of dumps.wikimedia.org files? I was just trying to Hive for uri_host like '%dumps%' and uri_host = 'dumps.wikimedia.org' (seems to be possibly waiting in line), but realized maybe there's a faster way someone knows off the top of their head? [12:05:04] uri_host = 'en.wikipedia.org' limit 5 is returning plenty fast from webrequest, but for some reason the 'dumps' variants not so much, even on just an hour of data (I did bump up the memory with HADOOP_OPTS="-Xmx2048m") [12:14:37] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10fgiunchedi) >>! In T329398#9250237, @BTullis wrote: > @fgiunchedi - I wonder if you might be able to advise here, please? > > We have an x509 certificate on-disk, but it's not exposed via a TCP service. >... [12:18:47] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10brouberol) @fgiunchedi Nice, thanks! According to you, would it be ok to include this module in a non-puppetmaster-related role? [12:55:16] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) Thanks, @nshahquinn-wmf! I've made [wikimedia/wmfdata-python/pull/46](https://github.... [13:02:25] 10Data-Platform-SRE: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10fgiunchedi) >>! In T329398#9253445, @brouberol wrote: > @fgiunchedi Nice, thanks! According to you, would it be ok to include this module in a non-puppetmaster-related role? In general yes, although from a... [13:02:44] Hi brouberol - it's possible :) [13:02:58] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10jcrespo) [13:03:26] joal: thanks, on it. [13:03:40] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10jcrespo) I've been bold and added an extra test about testing backup is working after maintenance, let me know if that is reasonable and adapt to your needs. [13:05:01] 10Data-Platform-SRE, 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10Gehel) [13:05:03] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:05:51] !log deploying mw-page-content-change-enrich with the new kafka broker list T336044 [13:05:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:05:54] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [13:08:29] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) >>! In T284150#9253623, @jcrespo wrote: > I've been bold and added an extra test about testing backup is working after maintenance, let me know if that is reasonable and adapt t... [13:20:56] (03PS5) 10Joal: Add unique-devices Iceberg schemas and scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964939 (https://phabricator.wikimedia.org/T347689) [13:27:06] I'm looking at https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage that mentions that a specific partman recipe must be used to retain data during a reimaging, but does not provide additional info. Is there more documentation somewhere about this point specifically? [13:28:38] (03CR) 10Gmodena: [C: 03+1] cirrussearch/update_pipeline/fetch_error use general error_type (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/963990 (owner: 10Peter Fischer) [13:28:54] brouberol: I can point you to some examples where the technique is used. Documentation may be a bit lacking though :-) [13:29:26] documentation by copy-pasting is fine by me :P [13:30:27] If you look in this file, you will see all of the server classes where the partition recipe starts with: ` echo reuse-parts.cfg` [13:30:28] https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/netboot.cfg#L95 [13:30:46] Or, in some cases it starts with `echo reuse-parts-test.cfg` [13:32:05] Here's an active ticket where a new reuse recipe is being defined for use: https://phabricator.wikimedia.org/T347738 [13:33:34] The only difference between the `reuse-parts.cfg` and `reuse-parts-test.cfg` is that the latter waits in the installer for us to check that the disk operations look right before formatting. [13:34:15] Often, if we verify that it works properly with the `-test` version, we can then replace it with the non-test version and allow non-interactive reimaging. [13:34:39] That's if we have lots of identical servers to reimage, we can test the first. [13:35:36] Thanks! That's more than enough to get me started! [13:49:50] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) I was looking at how to ensure that `kafka-jumbo100[7-9]` would retain their data during a reimage, and found the following netboot config, thanks to @BTullis: ` case $(debconf-get ne... [13:50:03] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10BTullis) Thanks all. So the next question for anyone is, which of the existing replicas would be the best for me to use to recreate the s2 replica on dbstore1007? Would it... [13:53:31] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) Given that `kafka-jumbo100[1-6]` are now empty of all data and are due for decommission, I'm thinking that we should have the following: ` case $(debconf-get netcfg/get_hostname) in \... [13:59:16] 10Data-Platform-SRE: Upgrade kafka-jumbo100[7-9] to Debian Bullseye - https://phabricator.wikimedia.org/T348495 (10brouberol) 05Open→03In progress [13:59:20] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [13:59:22] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) [14:03:47] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10jcrespo) > Would it be preferable to use one of the normal replicas, or a backup replica? cc @jcrespo - Do you have an opinion on what would be the best source for this reb... [14:17:52] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Re-clone dbstore1007:s2 following a crash - https://phabricator.wikimedia.org/T343109 (10jcrespo) In other words, what I would do if *I were you* is running at cumin1001, as root: ` transfer.py --type=decompress dbprov1004.eqiad.wmnet:/srv/backups/snapshots/la... [14:30:14] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) ` brouberol@cumin1001:~$ sudo cumin 'kafka-jumbo100[1-6].eqiad.wmnet' "ls /srv/kafka/data" 6 hosts will be targeted: kafka-jumbo[1... [14:55:41] btullis: I found out that kafka-jumbo100[1-6] each still hold about 270 established TCP connections to the kafka ports, some coming from k8s pods, logstash/kafka-test/kafka-logging hosts, and many other IPs w/o reverse DNS. [14:56:35] so if we were to shut down these hosts right now, I think it'd cause a pretty widespread disruption [14:57:20] I wonder if there might be other repos with connect strings to the clusters, others than puppet, refinery and deployment-charts [14:59:03] also, why would kafka-test and kafka-logging IPs be connected to kafka-jumbo. Would that be mirrormaker? [15:02:46] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) [15:02:56] 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/1 [15:03:15] 10Data-Engineering, 10Product-Analytics, 10Wikidata Analytics, 10Wmfdata-Python: wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) [15:03:51] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) [15:06:03] (03CR) 10Joal: "This patch should be applied on top of https://gerrit.wikimedia.org/r/c/analytics/refinery/+/964573 where the old hive computation script " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) (owner: 10Aqu) [15:06:46] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) Would be happy to work on this! - Note that this task came from work in {T341589}. [15:07:53] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) [15:11:06] brouberol: that is odd, but good thinking to check. I can't think that we use mirrormaker for anything *from* kafka-jumbo to [15:11:40] ...anywhere. [15:13:43] You could check the gitlab global search and/or codesearch.wmcloud.org for more references. [15:14:58] Oh we have a thing called burrow, which monitors consumer lag, I believe. I've never had to deal with it much, but might be worth searching for it. [15:15:21] come to think of it: we haven't removed these 6 nodes from puppet yet. The pr is still marked as wip. Any service config listing all jumbo hosts by iterating over each of them would get updated, and would cause service restarts [15:16:03] so we might start there and reassess. wdyt? [15:16:22] the pr in question https://gerrit.wikimedia.org/r/c/operations/puppet/+/965162 [15:16:56] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10Gehel) [15:18:55] for ex: all varnishkafka hosts might still hold a ref to these hosts in their config [15:20:16] Oh, ok. So we can probably start to work on removing them, then? [15:21:10] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) > Decide how to handle indirect calls of .logEvent() (i.e. calls to mw.track( 'event.Foo' )) There are no calls to `mw.... [15:21:26] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Epic: [EPIC] Deprecate mw.eventLog.logEvent() - https://phabricator.wikimedia.org/T317874 (10phuedx) [15:23:28] that's my understanding as well, yes [15:27:45] any use of `kafka_config('jumbo-eqiad')` would still include these 6 brokers atm: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/role/lib/puppet/parser/functions/kafka_config.rb#87 [15:35:24] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [15:35:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [15:36:35] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) Removing this as a dependency of deploying the Search Update Pipeline... [15:36:40] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) FWIW failures to fetch stream configs has disappeared since @phuedx p... [15:42:10] (03CR) 10Sbisson: [C: 03+1] T343183 add story_share event and bump to version 1.2.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965845 (owner: 10Conniecc1) [15:43:54] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10matmarex) [15:53:17] (03CR) 10Sbisson: [C: 03+1] T343183 add "stoty share" event; add "user_is_anonymous" field and bump to version 1.1.0 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [16:05:26] (03PS1) 10Kimberly Sarabia: Revert "Refactor schema structure" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966226 [16:37:35] (03CR) 10Xcollazo: [C: 03+1] "Just confirming I looked these changes over. LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [16:44:11] 10Data-Platform-SRE: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10bking) [16:44:25] 10Data-Platform-SRE: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10bking) [16:44:30] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) [17:08:27] 10Data-Engineering: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10XenoRyet) @BTullis Sorry for not getting back to you on this sooner. Yes, dropping a tarball of this stuff in my home directory sounds like a good idea. [17:19:17] 10Data-Platform-SRE: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10bking) a:03bking [17:20:25] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) Apologies for not catching this earlier: - to be aligned with the dumps that are imported into hdfs we must select a particular set of file... [17:21:19] 10Data-Platform-SRE: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10bking) [17:26:27] (03CR) 10Jdlrobson: "Can't we just bump the existing version number? Isn't reverting in this repo risky?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966226 (owner: 10Kimberly Sarabia) [17:44:45] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:50:04] https://www.irccloud.com/pastebin/weEU0EVU/ [17:50:55] stevemunene: oops meant to dm that but here's what I mean about the scap.cfg not being present for the new wmde dag [17:51:45] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:59:35] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bb8fddcb-96c9-4078-bcf9-9fd9d4c95358) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their service... [18:09:18] (03Abandoned) 10Kimberly Sarabia: Revert "Refactor schema structure" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966226 (owner: 10Kimberly Sarabia) [18:26:44] (03PS1) 10Kimberly Sarabia: Schema bump desktop and mobile web ui [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966278 (https://phabricator.wikimedia.org/T346106) [19:20:06] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:27:09] (03CR) 10Jdlrobson: [C: 03+2] Schema bump desktop and mobile web ui [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966278 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [19:27:41] (03Merged) 10jenkins-bot: Schema bump desktop and mobile web ui [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966278 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia) [19:29:04] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:31:02] (03PS4) 10Aqu: Use canonical_data.countries when populating the referer tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965771 (https://phabricator.wikimedia.org/T348504) [19:45:42] (03PS1) 10DLynch: Add a new init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966289 (https://phabricator.wikimedia.org/T243641) [19:46:52] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:56:46] (EventgateValidationErrors) resolved: ... [19:56:46] eventgate-analytics-external stream eventlogging_MobileWebUIActionsTracking validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationError [20:05:20] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [20:06:25] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [20:06:38] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [20:06:50] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) [20:06:57] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [20:07:03] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [20:08:37] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) [20:08:54] 10Quarry, 10Patch-For-Review: Create minikube deploy for quarry - https://phabricator.wikimedia.org/T301469 (10rook) [21:06:05] (03PS2) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) [21:15:05] (03CR) 10CI reject: [V: 04-1] Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [21:38:49] (03PS2) 10Conniecc1: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 [22:01:19] 10Data-Platform-SRE, 10Patch-For-Review: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10bking) @dcausse We're working on the cookbook change, but this made me realize that the current reload cookbook always takes the latest lexemes dump, which isn't in sy... [22:02:15] (03PS1) 10Conniecc1: Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 [22:02:42] (03CR) 10CI reject: [V: 04-1] Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1) [22:05:03] (03PS2) 10Conniecc1: T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 [22:05:31] (03CR) 10CI reject: [V: 04-1] T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1) [22:27:06] 10Data-Engineering, 10Tool-Pageviews: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) [22:27:23] 10Data-Engineering, 10Tool-Pageviews: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) p:05Medium→03Unbreak! Raising to UBN as per duplicate task [22:28:11] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) [22:30:03] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) [22:31:42] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10MusikAnimal) a:03Sfaci Sorry for all the noise! I didn't realize until after merging the old task was assigned etc. [22:32:43] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed