[01:15:13] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:28:57] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:45:13] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:58:59] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:29:41] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:15:17] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:29:05] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:45:05] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:56:13] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:09:23] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:15:56] !log Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-1-26 [08:15:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:17:01] bonjour [08:18:28] Bonjour Luca :) [09:19:05] (03PS3) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [09:19:36] Hello all. [09:23:58] Hi btullis [09:26:42] (03CR) 10jerkins-bot: [V: 04-1] Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal) [09:30:24] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:41:51] (03PS4) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [09:44:02] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:51:36] interesting read: https://eng.uber.com/cost-efficiency-big-data/ [09:55:06] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) After 2 days I think we can begin to see the pattern that we were expecting. Namely that aqs... [10:01:02] btullis: o/ not sure if you saw https://phabricator.wikimedia.org/T300164, everything looks good now but we should quantify the data loss :( (it is not related to the last changes but a pre-existing issue sigh) [10:01:24] there are some follow ups to do, like fixing apt/puppet and add monitors [10:01:43] (not super urgent, we can do it next week) [10:01:49] I'll add some comments [10:03:24] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10elukey) Besides figuring out the data loss, we should: 1) Fix puppet and apt, the right varnishkafka version should be deployed. Not su... [10:19:01] elukey: Right. I saw the chat yesterday, but this is the first time that I've caught up with it. So we didn't get an alert because the systemd unit for varnishkafka said it was active and running, right? Even though behind the scemes it was restarting? [10:23:14] btullis: exactly yes [10:23:41] puppet installs an old version, and the traffic team handled new versions via a special component [10:23:54] (something that I wasn't aware) [10:24:36] the instances were working but not sending any traffic, and we don't have monitors for it [10:24:46] (we have monitors for delivery errors) [10:27:36] Are those prometheus based? I don't see them in Icinga. [10:28:56] Oh yeah, I see them. varnishkafka)deliver_alert.pp [10:29:28] exactly yes [10:29:57] they are aggregated, otherwise there was a lot of spam [10:32:00] OK, so we just need to add another alarm to this, but we should really be doing it in the context of https://phabricator.wikimedia.org/T293399 whereby all check_promethus alerts are migrated to alertmanager. [10:37:16] Ah, right a single check is associated with alert1001 instead of separate checks for each cp host. Got it. [10:58:30] 10Data-Engineering, 10Data-Engineering-Kanban: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10BTullis) [10:58:50] 10Data-Engineering, 10Data-Engineering-Kanban: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10BTullis) [10:59:45] 10Data-Engineering: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10BTullis) [11:02:37] Created a sub-task for the monitoring component. Might be a good one for raz.zi to look at, maybe. I've done most of the alertmanager stuff, so would be useful to share this. [11:03:51] The puppet one is important too, given that it's likely to happen again on the next reimage. [11:11:42] definitely, all the new hosts in Marseille were affected :D [11:43:33] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10BTullis) This confirms all of the package install dates: ` btullis@cumin1001:~$ sudo cumin 'cp1087.* or cp4021.* or cp403[3-6].*' 'zgrep... [11:55:21] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10BTullis) I make it the following number of days without data: cp1087 = 236 days cp4021 = 105 days cp4033 = 84 days cp4044 = 83 days cp4... [13:50:25] sorry all, been bad at communicating about ops week this week. Quiet until yesterday but a lot of outstanding stuff now. I'll look at it later today, but will try asap to see what's up with webrequest [13:58:37] ah, nvm, just a restart. K, all looks good except refine immediate and delayed for the new schemas. I feel like that still hasn't quite figured out how to deal with the bootstrap case where there's no data, so we get alerts. [14:01:39] sorry for not sending an email on alerts about my yesterday restarts milimetric :S [14:02:08] no problem at all, I thought I missed a failed webrequest and I was like ono! [14:02:41] if anybody wants to poke around OpenMetadata, it's up on localhost:8585 via tunnel: ssh -N an-test-client1001.eqiad.wmnet -L 8585:127.0.0.1:8585 [14:02:58] whenever I get another minute I'll start connecting it to the metastore [14:12:19] (03CR) 10Ottomata: "Ah, this was merged, but eventgate-main doesn't know about it yet, and our produce canary events job uses the latest schema to produce can" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737429 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [14:20:28] Has anyone got any guidance on this question please? https://phabricator.wikimedia.org/T295733#7637646 [14:21:23] It's a request from produc-analytics. They are using the `kerberos::systemd_timer` type in puppet and trying to capture a log file. [14:22:20] The log file doesn't capture anything because rsyslog is looking for a `programname` of `product-analytics-movement-metrics` [14:23:01] Instead the only output from the unit file has the programname `kerberos-run-command`instead. [14:24:02] I'm trying to work out the best advice to give, given that their task involves launching a bash script (currently named main.sh) which then runs a number of jupyter notebooks. [14:26:27] btullis: is there another rsyslog selector we coudl use? [14:27:05] or, they could make their job write the output to a file, or use python logging? [14:27:21] I think it just inherits `programname=$title` from the systemd::timer::job [14:29:02] right, but w could change that [14:29:13] if via kerberos::systemd_timer [14:29:18] add some params to systemd::timer::job [14:35:08] Having trouble identofying a suitable selector anyway: https://www.rsyslog.com/doc/v8-stable/configuration/properties.html#message-properties [14:35:51] I've thought I could give then sudo access to run `sudo journalctl -u product-analytics-movement-metrics` but that's not a great solution. [14:36:46] Probably advising them to log to a file is better in the log run. [14:37:38] oh, they can do that, if they already have sudo to that user and it woks [14:37:41] thats totally fine [14:39:02] I'm not sure that it does work yet. I know that Maya couldn't do it from her own user account, but I didn't think about checking whether the shared system user account can run that command. [14:39:13] * joal is sad of seeing notebooks run as production-style jobs :( [14:39:18] elukey: david is getting a image bump up for eventgate-wikimedia, after that i will deploy eventgate-main and eventgate-analytics, ya? [14:39:33] yeah, i think maybe it doesn't work btullis , not sure though [14:39:56] maybe there is some way to increase the perms on a system unit's journal logs? dunno [14:40:49] joal: Should we advise a different approach? Here are the sources for the notebooks. https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/wmf-product/jobs/+/refs/heads/master/movement_metrics/ [14:41:08] ottomata: <3 [14:41:29] ottomata: I'll prep also a change for eventstreams [14:41:43] joal: They're deployed via git/puppet : https://github.com/wikimedia/puppet/commit/92b4fd9197f1e4e16f758ad81375290b0c683b5c [14:41:54] okaY! [14:42:17] joal: I'm just not sure how we can advise them to do it in a better way. [14:42:18] ottomata: given how much traffic jumbo handles (definitely more than all clusters) I am inclined to focus on kafka logging eqiad/codfw first, and then move to the other clusters (for the PKI move) [14:42:34] but the sooner we move clients the better of course :) [14:44:09] elukey: errrr is that not what we are doing now? [14:45:00] ottomata: yes yes I meant that I'll try to move jumbo and main after logging just as precaution [14:45:05] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:45:30] (on the broker side, the TLS certs I meant) [14:45:38] OH oh [14:45:42] k [14:46:08] (I am moving all rsyslog clients to the new bundle as we speak, so nice) [14:46:36] btullis: I think we'll advise after Airflow [14:48:40] joal: OK. Do you mean once we're happy with our Airflow deployment, we will have a new system for them to use instead of running notebooks in production? Until then, muddle through? [14:49:56] joal: i'm with you...but if they use git to deploy their notebooks, i'm not sure how much I can actually complain [14:50:09] the only real difference is debugability, you can't really read the code without a web UI [14:50:19] but, you also can't read the code of a .jar file or a compiled binary [14:50:27] so The only thing we'd have to be aware of about deps is if someone has a pyspark that lives inside a conda env. E.g. if the pyspark job file is at conda_env/bin/mysparkjob.py. In that case, mysparkjob.py needs to exist wherever the spark-submit command is run. We were trying to work around this with that `call.py` thing, but it was very hacky and I'm not so sure it always worked right. [14:50:29] oops [14:50:31] WRONG PASTE [14:50:37] i meant to paste [14:50:39] so ¯\_(ツ)_/¯ [14:50:51] btullis: I think that's right - the problem is also broader than just notebooks, it's a bout common definition of metrics, what production means for data, and tools are suited for that [14:51:12] ottomata: yeah, I hear your point [14:51:37] I'm gonna stop complain I guess :) [14:52:45] i think we can complain if we have a good reason :) [14:52:58] but i can't think of one other than curmudgeonlyness :) [14:54:05] ottomata: fait point :) [14:54:17] s/t/r [14:54:21] :) [14:55:10] (03PS5) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [14:56:10] For right now, they just need to get the existing script running with a different process title, so what about if I just did a [14:56:11] `command => exec -a product-analytics-movement-metrics ${jobs_dir}/movement_metrics/main.sh` [14:56:11] here: https://github.com/wikimedia/puppet/blob/production/modules/statistics/manifests/product_analytics.pp#L43 [14:56:24] Worth a go? [14:57:55] (03PS13) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [14:58:45] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:59:03] Gone for kids folks - back at standup [15:00:16] huh, btullis cool idea [15:00:19] why not, give it a go [15:00:46] Will do. Also gone for kids, back at 16:00 UTC. [15:02:37] (03CR) 10jerkins-bot: [V: 04-1] Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal) [15:06:34] Giving it a go: https://gerrit.wikimedia.org/r/c/operations/puppet/+/757668 - I've got a meeting with Maya at 16:15 so if pcc is OK, then I'll try to deploy it and check to see if it behaves as expected. [15:25:03] eventstreams patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/757672 :) [15:31:56] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:37:18] merged elukey [15:39:46] thanks! [15:44:32] ok elukey deploying eventgate-analytics [15:44:35] then we can do eventstreams [15:51:35] ottomata: only if you have time! Otherwise another day [15:52:29] now is the time :) [15:55:51] 10Analytics, 10Analytics-Kanban, 10Pageviews-Anomaly: Article on Carles Puigdemont has inflated pageviews in many projects - https://phabricator.wikimedia.org/T263908 (10TheDJ) [16:16:47] heya aqu1 :] I reviewed your Airflow changes. The code looks great! I left some comments, mostly questions, since the way we organize code now will inform how we write other jobs. Thanks! https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/8 [17:04:51] (03PS14) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [17:05:19] (03CR) 10Phuedx: [WIP] Metrics Platform event schema (034 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [17:14:22] 10Data-Engineering: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10Snwachukwu) [17:18:54] 10Data-Engineering: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10Snwachukwu) a:03Snwachukwu [17:23:04] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Milimetric) This lines up with the dip in traffic that @kzimmerman was asking us about. I'm gonna loop her in here sooner than later, e... [17:50:12] milimetric: o/ what do you mean with dip in traffic? [17:50:38] the vk instances lost data for months, was it something prolonged in time? [17:52:03] yes elukey - trying to find a task [17:52:15] lovely [17:52:40] milimetric: when do you wish we do the loss-analysis? the sooned the better I guess? [17:54:47] elukey: yeah, basically Product Analytics noticed something and told us about it in December. They said it started happening sometime in October / November, so that would line up with most of the times that Ben found [17:55:17] milimetric: yeah it lines up with the theory of OS reimage of caching nodes [17:55:58] joal: sooner the better, yeah, when do you have time? [17:56:53] milimetric: in ~1h - is that good for you? [17:57:18] sounds good joal, who else was going to work on it? Sandra & Antoine? [17:57:45] if they can, yes milimetric - let's see if it's ok for them (late) [17:58:55] yep, I just made a thread in slack (I think we're done with IRC :P) [17:59:16] mforns: got time for me? airflow batcave? [18:01:16] milimetric: in a meeting, ping oyu when done! [18:18:05] milimetric: done, wanna cave? [18:18:19] omw there, cc ottomata [18:18:32] k [18:20:38] milimetric: mforns will join if you need me, or shortly, was working on an eventstreams deploy for elukey and it failed partway [18:35:23] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Geoeditors Monthly - https://phabricator.wikimedia.org/T300282 (10JAllemandou) [18:39:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Mediawiki Geoeditors Monthly - https://phabricator.wikimedia.org/T300282 (10JAllemandou) [18:47:56] elukey: eventstreams done! [18:51:59] milimetric: ready when you wish [19:45:56] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Milimetric) >>! In T300164#7656072, @BTullis wrote: > I make it the following number of days without data: > > cp1087 = 236 days > cp40... [19:55:52] 10Data-Engineering, 10Data-Engineering-Kanban: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10Antoine_Quhen) [21:07:09] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10Event-Platform, and 5 others: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10nshahquinn-wmf) [21:07:48] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Analytics, 10Privacy: Capture rev_is_revert event data in a stream different than mediawiki.revision-create - https://phabricator.wikimedia.org/T280538 (10nshahquinn-wmf) 05Open→03Resolved There is now a revision tag to identify reverts... [21:15:49] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Milimetric) Queries used for analysis: {F34933480} [21:19:25] Gone for tonight