[01:08:26] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Product-Analytics: Mediawiki history has no data on IP blocks - https://phabricator.wikimedia.org/T211627 (10nettrom_WMF) Moving this to the Data Engineering board because I think having block data for both registered and IPs in `mediawiki_user_history` (or maybe in a...
[01:09:22] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Product-Analytics: Distinguish between types of block events in the Mediawiki user history table - https://phabricator.wikimedia.org/T213583 (10nettrom_WMF) Moving this to the Data Engineering board because I think having this type of data on blocks in the Mediawiki U...
[08:01:36] <wikibugs>	 10Data-Engineering, 10SRE, 10observability, 10serviceops: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10elukey)
[08:32:37] * addshore waits for his event table to appear in hive
[08:48:00] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM! Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757035 (https://phabricator.wikimedia.org/T293406) (owner: 10MNeisler)
[09:37:29] <btullis>	 elukey: Would now be a good time to deploy and test https://gerrit.wikimedia.org/r/c/operations/puppet/+/742747 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/755435 ?
[09:37:36] <btullis>	 Are you around if I get stuck?
[09:43:06] <btullis>	 Plan will along the lines of:
[09:43:06] <btullis>	 * Notify traffic that I'm about to stop puppet and test these changes on cp3050
[09:43:06] <btullis>	 * Stop puppet on all cp* hosts
[09:43:06] <btullis>	 * Merge 742747 and 755435 
[09:43:06] <btullis>	 * Run puppet on cp3050 and check for correct configuration of /etc/varnishkafka/webrequest.conf
[09:43:07] <btullis>	 * Check the status of varnishkafka-webrequest.service which should have restarted automatically.
[09:43:29] <btullis>	 * Restart puppet on the fleet of cp* nodes once all is well.
[09:43:42] <btullis>	 Have I missed anything?
[09:46:26] <elukey>	 btullis: o/
[09:46:30] <elukey>	 sorry just seen the message
[09:46:36] <elukey>	 I am available :)
[09:46:44] <btullis>	 I only just typed it :-)
[09:47:15] <elukey>	 the plan is ok, remember that we run multiple vk instances
[09:47:34] <elukey>	 on text nodes, we have vk-webrequest, vk-statsd, vk-eventlogging
[09:48:02] <elukey>	 all three will be refreshed
[09:48:17] <elukey>	 you can see them in https://grafana.wikimedia.org/d/000000253/varnishkafka
[09:48:36] <elukey>	 the bundle-ca change will affect all three of them
[09:48:48] <btullis>	 OK, so all three will be updated to use the new PKI. Only webrequesat gets the cipher change, right?
[09:48:48] <elukey>	 meanwhile the format one only the webrequest
[09:48:52] <elukey>	 exactly
[09:49:46] <elukey>	 to test the change in the message sent to kafka it is sufficient to use kafkacat on a stat100x node and filter for the host updated (like 3050)
[09:50:01] <elukey>	 to test the pki change it should be enough to observe metrics and vk logs
[09:50:12] <elukey>	 and also using kafkacat
[09:50:24] <elukey>	 if there is a problem no messages will be displayed
[09:50:40] <btullis>	 OK, thanks. I hadn't thought of that.
[10:00:25] <btullis>	 I'm confused why this isn't returning anything.
[10:00:29] <btullis>	 https://www.irccloud.com/pastebin/NQU5PQfJ/
[10:02:10] <btullis>	 Ah, `webrequest_text`
[10:04:21] <elukey>	 to select a single node (for later on)
[10:04:22] <elukey>	 kafkacat -C -t webrequest_text -b kafka-jumbo1001.eqiad.wmnet | jq '. | select(.hostname == "cp3050.esams.wmnet")'
[10:04:27] <elukey>	 very handy
[10:05:08] <btullis>	 Nice, I had a grep, but I see that this was picking up peer cache hits.
[10:05:10] <elukey>	 at some point in the future we'll have to enforce TLS for all kafka clients, it would be great
[10:05:20] <btullis>	 Yep yep.
[10:05:27] <elukey>	 (too many things sigh)
[10:07:27] <btullis>	 !log btullis@cumin1001:~$ sudo cumin 'O:cache::upload or O:cache::text' 'disable-puppet btullis-T296064-T299401'
[10:07:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:07:31] <stashbot>	 T299401: VarnishKafka to propagate user agent client hints headers to webrequest - https://phabricator.wikimedia.org/T299401
[10:07:31] <stashbot>	 T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064
[10:12:58] <btullis>	 Merged the two changes.
[10:14:42] <elukey>	 super
[10:15:30] <btullis>	 Did we do a rolling-restart of kafka-main within the past 6 days? varnishkafka-statsv.service is showing kafka disconnects in its status output, although it's running fine now.
[10:16:45] <elukey>	 not that I know of, I upgraded the kafka-main os to buster but it was earlier than that
[10:17:48] <btullis>	 Not to worry, just interested. Puppet applied. All three varnishkafka instances restarted and running.
[10:19:39] <elukey>	 on cp3050?
[10:19:55] <elukey>	 (I just fixed the instance selection in the grafana dashboard, it wasn't working)
[10:20:30] <btullis>	 Yes, only on cp3050. I also see `ch_ua`,`ch_ua_mobile`,`ch_ua_platform` in the events. Although I had to add the `-o end` offset to kafkacat first.
[10:20:50] <elukey>	 perfect
[10:20:58] <elukey>	 let's do it on another node as well
[10:21:07] <elukey>	 just to be sure, 3050 had some changes already applied IIRC
[10:22:24] <elukey>	 I see the new fields as well, the values are getting in
[10:22:39] <elukey>	 there seems to be a weird double quoting
[10:22:40] <elukey>	  "ch_ua_platform": "\"Android\""
[10:23:27] <btullis>	 Yes, I see what you mean.
[10:25:28] <phuedx>	 o/
[10:25:30] <elukey>	 o/
[10:25:36] <wikibugs>	 10Analytics, 10Wikidata, 10Wikidata Analytics: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier) - https://phabricator.wikimedia.org/T299358 (10Lucas_Werkmeister_WMDE) That seems to have fixed the logs; using the same `zgrep` pipeline from the task description (now...
[10:25:53] <elukey>	 phuedx: Ben deployed the change to one caching node, all good but we see stuff like
[10:26:01] <elukey>	 "ch_ua_platform": "\"Windows\""
[10:26:11] <elukey>	 is the double quoting expected? I mean, part of the specs
[10:26:35] <elukey>	 (one set of quotes are added by the json message of varnishkafka, but the second pair seems to be the value of the header)
[10:27:09] <elukey>	 btullis: something interesting is that cp3050 increased its cpu usage
[10:27:51] <elukey>	 weird
[10:27:59] <phuedx>	 elukey: Yes. The Sec-CH-UA header will contain quoted values, e.g. Sec-CH-UA: "<brand>";v="<significant version>", ...
[10:28:13] <elukey>	 phuedx: perfect thanks for confirming, all good then :)
[10:28:18] <phuedx>	 Other examples are given here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-CH-UA#examples
[10:31:45] <btullis>	 elukey: I'm not sure that I understand what the axes represent on this graph: https://grafana.wikimedia.org/d/000000253/varnishkafka?viewPanel=42&orgId=1&var-datasource=esams%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=cp3050
[10:32:37] <elukey>	 btullis: it should be the cpu time spent in the cgroup that runs the varnishkafka instance
[10:32:38] <btullis>	 I see what you mean about an uptick, but if you compare it to the host CPU overall (https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cp3050&var-datasource=thanos&var-cluster=cache_text&from=now-1h&to=now) I can't make out what the effect represents.
[10:32:55] <btullis>	 Ah, OK. Good to know.
[10:33:03] <elukey>	 nothing really, and the other two instances that you restarted are ok
[10:34:12] <elukey>	 maybe it is a temporary weirdness
[10:34:15] <btullis>	 Yeah, looks like it is dropping as well, so perhaps the restarts just caused it to spike. I'll wait a few more minutes to see if it drops back to what it was, before re-enabling puppet on all cp-* nodes.
[10:34:50] <elukey>	 even if it stays at that level it should be fine, it is comparable with other instances
[10:39:13] <btullis>	 OK, re-enabling puppet now.
[10:39:38] <elukey>	 it will be interesting to observe if other instances raise their cpu-usage as well
[10:39:49] <elukey>	 and if the tls latencies to the brokers will stay the same etc..
[10:40:24] <btullis>	 Yeah. Interestingly, the eventlogging one seems to have reduced its CPU usage slightly, but this varies more over time anyway. https://grafana.wikimedia.org/d/000000253/varnishkafka?viewPanel=42&orgId=1&var-datasource=esams%20prometheus%2Fops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3050
[10:41:16] <elukey>	 btullis: congrats for your first vk deployment :)
[10:41:48] <wikibugs>	 10Analytics, 10Wikidata, 10Wikidata Analytics: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier) - https://phabricator.wikimedia.org/T299358 (10Lucas_Werkmeister_WMDE) 05Open→03Resolved a:03Lucas_Werkmeister_WMDE The file size also went back up starting from t...
[10:42:07] <btullis>	 Thanks for being patient and super-helpful.
[10:43:59] <elukey>	 super happy to help, the more people know how these things work the better :)
[10:58:14] <phuedx>	 ^
[10:58:45] <phuedx>	 Congratulations btullis!
[10:59:10] <btullis>	 Thank you. :-)
[11:00:19] <wikibugs>	 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) The last clients to move should be eventstreams and eventgate!  Next steps: - deploy https://gerrit.wikimedia.org/r/7534...
[12:48:54] <moritzm>	 FYI, I'm rolling out apache security updates on the various analyics web UIs in the next 1-2 minutes, there might be brief (and unvoidable) blips
[12:50:08] <moritzm>	 those are done now
[12:55:15] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:07:33] <btullis>	 moritzm: many thanks
[13:47:57] <wikibugs>	 (03PS1) 10Joal: Add user-agent client hints to webrequest tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757445 (https://phabricator.wikimedia.org/T299402)
[13:48:43] <joal>	 mforns, milimetric, btullis - I just sent --^ after the update of varnishkafka for the CH-UA headers
[13:48:57] <joal>	 I'll deploy today if you give me + :)
[13:51:25] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. :-)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757445 (https://phabricator.wikimedia.org/T299402) (owner: 10Joal)
[14:02:05] <joal>	 thanks btullis for the quick review - I'm gonna wait a second +1 and that'll make +2 :)
[14:17:06] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Add user-agent client hints to webrequest tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757445 (https://phabricator.wikimedia.org/T299402) (owner: 10Joal)
[14:30:01] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:36:08] <elukey>	 ottomata: o/ hiii since you are in eventgate-land, do you have time for the ca-bundle deployment ?
[14:36:21] <ottomata>	 les do it
[14:36:22] <elukey>	 (checking that staging looks good etc..)
[14:36:29] <ottomata>	 oh i should wait for my restarts then.
[14:36:36] <ottomata>	 ah whatevs i'll finish eventgate-analytics and do it again
[14:36:45] <ottomata>	 elukey:  did we merge already?
[14:37:19] <elukey>	 ottomata: nope just done :(
[14:37:28] <ottomata>	 kay
[14:37:39] <ottomata>	 finishing my roll restart, we can apply changes after
[14:37:45] <elukey>	 super
[14:38:37] <elukey>	 you shouldn't get the new changes up to the next puppet run but check with helmfile diff before syncing just in case
[14:39:10] <ottomata>	 done, okay, also...i have meetings  starting in 20 mins
[14:39:17] <ottomata>	 can i start with you, and if needed you finish?
[14:39:26] <elukey>	 sure sure
[14:39:34] <ottomata>	 ok
[14:39:41] <ottomata>	 i'm looking at eventgate-logging-external staging diff
[14:39:43] <ottomata>	 i see the change
[14:39:49] <ottomata>	 ok  to deploy there in staging?
[14:39:52] <elukey>	 (thanks for the time and patience)
[14:39:53] <ottomata>	 +               ssl.ca.location: /etc/ssl/certs/wmf-ca-certificates.crt
[14:39:55] <elukey>	 +1
[14:40:01] <elukey>	 ah wait a sec
[14:40:10] <elukey>	 there should also be an image change
[14:40:21] <ottomata>	 oh yes
[14:40:22] <ottomata>	 image cahnge too
[14:40:30] <ottomata>	 +           image: "docker-registry.wikimedia.org/wikimedia/eventgate-wikimedia:2022-01-11-142353-production"
[14:40:33] <ottomata>	 elukey:  do we have a ticket
[14:40:37] <ottomata>	 want to attach it to log message
[14:40:44] <ottomata>	 ?
[14:40:52] <elukey>	 lemme check
[14:41:00] <elukey>	 T296064
[14:41:01] <stashbot>	 T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064
[14:42:37] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:43:51] <elukey>	 I think that --^ failed due to the deploy
[14:46:03] <ottomata>	 maybe, i think separate issue, is the reason i was restarting eventgate-analytics
[14:46:14] <elukey>	 yep yep
[14:46:17] <elukey>	 I've restarted it
[14:46:23] <joal>	 ottomata: o/
[14:46:31] <ottomata>	 o/
[14:46:41] <joal>	 if you have a minute, could you +1 that small patch of me so that I could deploy please?
[14:47:03] <joal>	 Mwarf ottomata - didin't notice milimetric already did - my bad
[14:47:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add user-agent client hints to webrequest tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757445 (https://phabricator.wikimedia.org/T299402) (owner: 10Joal)
[14:47:30] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/757445 (https://phabricator.wikimedia.org/T299402) (owner: 10Joal)
[14:47:43] <ottomata>	 elukey:  eg logging exteranl staging looks good
[14:47:52] <ottomata>	 ok if i continue to codfw and eqiad?
[14:48:19] <joal>	 a-team: I'm gonna deploy refinery now - anything you'd like me to add to the current 2 things (webrequest CH-UA and edit-hourly update)
[14:48:34] <btullis>	 Nothing from me, thanks.
[14:48:55] <elukey>	 ottomata: +1
[14:49:34] <milimetric>	 sorry I had no time to deploy, jo
[14:49:47] <joal>	 np milimetric - it's actually good we waited for today :)
[14:50:18] <wikibugs>	 10Data-Engineering, 10SRE, 10observability, 10serviceops: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) > A better and more stable Kafka Mirror Maker (even if after all the work that Andrew did we have something very stable as well now) This really does look great, and has s...
[14:50:28] <joal>	 train is on the way!
[14:52:25] <joal>	 !log Deploy refinery with scap
[14:52:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:52:40] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:54:15] <ottomata>	 elukey:  since i have your attention! :)
[14:54:17] <ottomata>	 https://phabricator.wikimedia.org/T296543#7647470
[14:54:41] <ottomata>	 summary: i want to make skein start a yarn AppMaster, that will run spark-submit
[14:54:50] <ottomata>	 airflow will be doing this, so it will need to be able to use a keytab
[14:55:16] <ottomata>	 from what I can tell, it should be ok to use yarn localresources to upload the keytab to the yarn appmaster
[14:55:47] <ottomata>	 (eg logging-external done and looks good, moving on to eventgate-analytics-external)
[14:56:01] <elukey>	 !log elukey@cp4035:~$ sudo systemctl restart varnishkafka-webrequest.service - metrics showing messages stuck for a poll()
[14:56:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:56:47] <elukey>	 ottomata: yes it should be good, everything is encrypted and authenticated
[14:57:04] <elukey>	 and it should be a temporary per-user cache so I think it is ok
[14:57:08] <ottomata>	 wow that is the answer I wanted to hear!  joal ^ :)
[14:57:12] <ottomata>	 yeah it a per app cache
[14:57:24] <elukey>	 but it is not shared by users right?
[14:57:26] <ottomata>	 no
[14:57:39] <ottomata>	 i mean, not even a same user in a different app coudl acces it
[14:57:50] <elukey>	 exactly yes this was my understanding
[14:58:00] <elukey>	 when spark uploads the keytab it uses the same IIRC
[14:58:10] <ottomata>	 ya
[14:58:14] <joal>	 that's my understanding as well, but I'm always concerned when moving secrets around
[14:58:26] <joal>	 ok all good then - let's do it :)
[14:59:17] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Beautiful words from Luca re uploading keytab: > ottomata: yes it should be good, everything is encrypted and authenticated  :)
[15:00:15] <ottomata>	 okay elukey  eg analytics done.
[15:00:22] <ottomata>	 i gotta go to meeting
[15:00:28] <ottomata>	 can you do eventagte-analytics and eventgate-main?
[15:02:36] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10elukey) >>! In T296543#7652986, @Ottomata wrote: > Beautiful words from Luca re uploading keytab: >> ottomata: yes it should be good, everything is encrypted and authenticated >  > :)...
[15:04:16] <elukey>	 ottomata: we can do it later on or tomorrow together, no rush, I'd be happier with you around :)
[15:06:06] <elukey>	 !log elukey@cp4035:~$ sudo systemctl restart varnishkafka-eventlogging.service - metrics showing messages stuck for a poll()
[15:06:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:06:16] <elukey>	 I don't like this
[15:08:16] <elukey>	 but it seems limited to a couple of hosts in ulsfo
[15:08:55] <elukey>	 in the logs for all instances I see
[15:08:55] <elukey>	 KAFKAERR: Kafka error (-195): ssl://kafka-main1004.eqiad.wmnet:9093/1004: Disconnected (after 1199322ms in state UP)
[15:10:46] <elukey>	 !log elukey@cp4036:~$ sudo systemctl restart varnishkafka-statsv
[15:10:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:10:50] <elukey>	 !log elukey@cp4036:~$ sudo systemctl restart varnishkafka-eventlogging
[15:10:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:19:56] <elukey>	 mmmm cp403[5,6] show no traffic
[15:20:02] <elukey>	 they are maybe new/special nodes
[15:23:05] <ottomata>	 ok elukey  lets do later
[15:27:01] <joal>	 !log Deploy refinery to HDFS
[15:27:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:40:24] <joal>	 !log Kill-restart edit-hourly oozie job after deploy
[15:40:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:40:51] <elukey>	 so folks on some ulsfo caching nodes we have been dropping vk messages for a while :(
[15:40:58] <elukey>	 even before today's changes
[15:41:14] <elukey>	 vk was failing to get the shm handle of varnish
[15:41:31] <joal>	 hm - we have almost not seen errors of missing data elukey - one yesterday is all I recall in the past weeks
[15:42:47] <elukey>	 joal: I hope to be wrong, I'll open a task
[15:44:19] <joal>	 !log Kill-restart webrequest oozie job after deploy
[15:44:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:46:27] <ottomata>	 joal:  i think if a vk node is not sending any data, we wouldn't have any data loss info aout it
[15:46:29] <ottomata>	 about *
[15:47:16] <joal>	 ottomata: if we're missing full vk hosts, yes - if a host misses some messages is what I was thinking
[15:49:49] <elukey>	 joal: I think that due to some weird cp node statuses we have dropped traffic from a lot of cp nodes completely 
[15:50:09] <elukey>	 cp3045 for example
[15:50:16] <elukey>	 Valentin just restarted it with the correct settings
[15:50:18] <joal>	 okey :S do we have an idea from how long elukey?
[15:50:54] <elukey>	 joal: see https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-3h&to=now&var-datasource=ulsfo%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All
[15:51:10] <elukey>	 still not a precise idea, the last varnishkafka upgrade was in late 2020..
[15:52:58] <joal>	 ack elukey 
[15:54:06] <joal>	 !log Add new CH-UA fields to wmf_raw.webrequest and wmf.webrequest
[15:54:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:01:50] <elukey>	 ok traffic restored on a lot of nodes
[16:03:03] <elukey>	 the problem was cross-instance, so not all webrequest-related
[16:03:14] <elukey>	 Going to make a post-mortem so we can understand the damage
[16:07:27] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Metrics-Platform: Add user agent client hints to the  `webrequest` table - https://phabricator.wikimedia.org/T299402 (10JAllemandou)
[16:08:01] <joal>	 thank you very much elukey 
[16:13:21] <ottomata>	 elukey:  is it just those 2 nodes?
[16:15:15] <ottomata>	 even so...that is 1% of webrequest_text?  is that right?
[16:15:29] <ottomata>	 about 1000 reqs/sec handled by those 2 nodes?
[16:16:19] <ottomata>	 webrequest_text around 100K / second?
[16:17:02] <elukey>	 the affected nodes are
[16:17:03] <elukey>	 cp1087.eqiad.wmnet,cp[4021,4033-4034,4036].ulsfo.wmnet
[16:17:08] <elukey>	 and cp4035
[16:18:11] <joal>	 looking at chart it seems cp1087.eqiad.wmnet has not come back sending data
[16:18:12] <ottomata>	 and elukey, we lost the requestt logs then, right?  or was varnish not serving any reqs?
[16:18:22] <elukey>	 the former yes
[16:18:41] <ottomata>	 nasty
[16:18:48] <ottomata>	 yeah we should quantify
[16:18:53] <elukey>	 joal: checking yes
[16:18:58] <ottomata>	 that might be significant
[16:19:06] <elukey>	 not all webrequest, to be clear
[16:19:11] <elukey>	 various instances
[16:19:26] <ottomata>	 yes, but all webrequests served by those instances, right?
[16:19:32] <joal>	 3 hosts in upload from ulsfo and 2 in text from ulsfo
[16:19:44] <ottomata>	 ok upload slightly less important 
[16:20:00] <elukey>	 nono I need to verify, on some node it may have hit say only eventlogging/statsv
[16:20:04] <ottomata>	 oh
[16:20:07] <ottomata>	 k
[16:20:11] <joal>	 and the one from eqiad, but I don't yet have it back to check
[16:22:17] <elukey>	 joal: should be working now
[16:22:22] <elukey>	 I see logs from kafkacat
[16:25:09] <elukey>	 so for webrequest, I'd say 3 vk instances: cp403[56] and cp1087
[16:25:11] <elukey>	 3 text nodes
[16:25:47] <elukey>	 the rest *should* be related to eventlogging and statsv
[16:28:34] <elukey>	 https://phabricator.wikimedia.org/T290694
[16:29:00] <elukey>	 so this is re-assuring - my theory is that these hosts are new nodes (like pooled during the past couple of months)
[16:29:11] <elukey>	 and for some reason, they came up with the wrong version of vk
[16:29:16] <elukey>	 and never sending data
[16:31:48] <joal>	 again elukey - thank you <3
[16:52:56] <wikibugs>	 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10elukey)
[16:53:14] <elukey>	 joal, ottomata - draft of the post-mortem in --^
[17:01:02] <wikibugs>	 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Ottomata) > varnishkafka-webrequest on cp1087, cp4035 and cp4036 (text) > varnishkafka-webrequest on cp4021, cp4033, cp4034 (upload)  We...
[17:33:14] <elukey>	 folks are we doing the sre sync?
[17:33:30] <joal>	 elukey: we're in planning session, you'll be alone :S
[17:33:32] <btullis>	 Sorry. we're in quarterly planning.
[17:33:41] <elukey>	 ahh okok np
[17:47:16] <wikibugs>	 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10elukey) The main issue is:  ` varnishkafka | 1.0.14-1 |  buster-wikimedia |               main | amd64, source varnishkafka |  1.1.0-1 |...
[17:49:27] <wikibugs>	 (03PS11) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[17:51:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[17:53:40] <wikibugs>	 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10elukey) >>! In T300164#7653547, @Ottomata wrote: >> varnishkafka-webrequest on cp1087, cp4035 and cp4036 (text) >> varnishkafka-webreque...
[18:02:38] <addshore>	 ottomata: any idea how long after receiving events I should see my table appear in the event db in hadoop?
[18:02:45] <addshore>	 or did I miss another step :P
[18:09:39] <joal>	 addshore: if there was no error on validation etc, I think it's something like a few hours (less than 6 for sure)
[18:10:15] <addshore>	 ahh, maybe i should check the validation logs ;)
[18:20:25] <wikibugs>	 (03PS12) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan)
[18:26:41] <wikibugs>	 (03PS1) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934)
[18:30:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal)
[18:32:38] <ottomata>	 addshore:  yea, usually 2ish
[18:32:51] <ottomata>	 u got no table?  lets see
[18:32:53] <ottomata>	 are the events in kafka?
[18:34:28] <ottomata>	 yes i see some
[18:34:54] <ottomata>	 addshore:  you hve a table
[18:34:57] <ottomata>	 event.mwcli_command_execute
[18:35:15] <addshore>	 *looks again*
[18:36:25] <addshore>	 nice, the table list just went form 253 to 255 for me :D
[18:38:15] <wikibugs>	 (03CR) 10Ottomata: Integrate SparkSQLNoCLIDriver and HiveToCassandra (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal)
[18:38:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal)
[18:38:36] <ottomata>	 :)
[19:56:07] <wikibugs>	 (03PS2) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934)
[21:11:15] <wikibugs>	 (03PS28) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415)
[22:10:13] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] rdf-streaming-updater: add a "reconcile" operation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737429 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[22:10:40] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] rdf_streaming_updater: add a reconcile event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/756536 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[22:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: add a "reconcile" operation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737429 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[22:11:39] <wikibugs>	 (03Merged) 10jenkins-bot: rdf_streaming_updater: add a reconcile event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/756536 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[22:33:53] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:45:15] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:58:59] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers