[08:54:26] 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) I tried to convince rsyslog to retry/resume the action with: ` action(type=... [09:21:22] akosiaris: o/ [09:21:49] hi [09:22:08] if you have a moment I'd have a question about ORES and Redis [09:22:12] * elukey sees Alex running away [09:22:30] lol [09:22:42] ask, I reserve the right to not answer :P [09:22:48] ahahhaha [09:23:01] anything I say can and will be used me against me in a court of SREs [09:24:04] so Effie and I were wondering how rdb reboots were handled in the past - I see two options 1) just reboot and in case celery complains afterwards, restart it 2) flip the rdbXXXX references in ORES' config to its replica (but then there is the question about what to do with redis replication itself etc..) [09:24:11] have you done it in the past? [09:24:18] just to have some datapoints [09:25:38] given what 2) involves I'd be a little hesitant in doing it [09:26:16] meanwhile the docs, no idea how up to date, claim that we can just stop celery [09:26:21] reboot, and then start it again [09:27:01] so in the past we 've tried a variety of approaches. We used to have a DNS record oresrdb.svc.eqiad.wmnet that would be a CNAME to one of the 2 servers. Flip it in DNS and either wait it out or do a rolling restart if we wanted it coordinated. It turned out we usually wanted it coordinated as not having it coordinated caused a mayhem a couple of [09:27:01] times as we would end up with the uwsgi workers putting jobs in the old server and no celery workers serving those jobs. [09:27:56] Since it was a gerrit change anyway after all, we dropped the DNS and moved to hardcode the currently active server. And then ORES went into maintenance mode and IIRC last time we just rebooted the rdb server and just said let's take the heat for a few minutes [09:29:11] but your 2) would be the more polite way of doing it. [09:29:36] for what is worth, the internal stuff (the mw extension) is going to retry anyway [09:29:46] but 2) involves also flipping the redis master/replica status right? [09:29:52] no, not really [09:30:02] assuming you do the following it's not needed [09:30:17] ah since they don't really need to be in sync [09:30:27] cumin -b 1 'ores*' 'systemctl restart celery-ores-worker ; systemctl restart ores-uwsgi' [09:30:40] so what happens is that one by one the nodes switch to the other server [09:30:46] yes yes [09:31:05] so they'll put jobs into a new celery redis queue [09:31:07] the cache is .. well a cache, it's nice if it's up to date, but even if it's not it's just increased latencies [09:31:17] and the jobs will go to the new celery redis queue as you say [09:31:31] what could possibly go wrong with ORES [09:31:37] the key is to make sure there is a celery worker that is able to serve the jobs the uwsgi puts in the queue [09:31:58] yeah, sad stories overall. [09:32:27] One day we'll have Kubeflow and our life will be better () [09:32:39] also keep in mind that the queue, while replicated IS NOT persisted in any of the rdb servers [09:32:43] and that's by design [09:33:00] so a restart of any rdb server will lose the entire queue, and that's usually ok [09:33:18] again, mostly because the internal clients will just retry [09:33:32] ok so wait [09:33:43] if I am reading all that correctly [09:33:58] external clients, namely researchers from what I gathered, either don't complain much, or don't know how to complain ? [09:34:22] we can indeed stop celery, restart, and start it after all ? [09:34:38] we are going to lose some data anyway it seems [09:34:40] akosiaris: one last question (because of my ignorance) - what we call 'score' cache, is it stored on the same redises? [09:35:01] elukey: same hosts, different redis instance, different port [09:35:09] that one is persisted [09:35:18] akosiaris: aahhh good, perfect [09:35:26] not even sure it should be tbh [09:35:42] there were multiple discussions that it might not truly be worth it [09:35:53] it is nice, otherwise after the reboot we'll start with a fresh cache if we flip back [09:36:01] effie: not just celery, uwsgi as well [09:36:11] ORES atm is highly dependent on it [09:36:22] otherwise it returns scores after seconds [09:36:24] (sigh) [09:36:37] elukey: yeah, but it gets rather quickly populated. [09:37:02] and it's usually stale from what I know. It's something like 15GB, but it's not like a big part of it is used [09:37:09] or at least that's my intuition [09:37:29] yes yes correct, since it gets populated after all edit, independently from the request of a score [09:38:07] the funny part is that an EventStreams ores score "stream" depends on scoring all edits regardless of score requests [09:38:16] the thing is, persisting it means a rather big chuck of time is being consumed on redis startup to load it [09:38:43] true [09:39:21] during that time scores are not cached or retrieved returning errors. At the rate the cache gets populated it might have just been better to start with a cold cache [09:39:27] are we overthinking this a bit? [09:39:37] but noone is going to spend time and see if this is true now that it is in maint mode [09:39:40] and that's good [09:40:25] 10serviceops, 10SRE Observability (FY2021/2022-Q1), 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10fgiunchedi) Ok I have a working patch to parse `omfwd` messages at https://phabricator.wikimedia.org/P17091, pending package and deployment [09:40:50] effie: I am prepping the change to switch the rdb references, it will be good to perform the procedure anyway (for my knowledge) [09:40:51] effie: niah, just sharing stories about the nuances I 'd say and the historical background. Overall I 'd say, depool it from discovery (it's already depooled perhaps?) and just go for the rdb reboot. [09:41:42] all the internal stuff is going to be retried, the external should not be reaching it anyway, so even the rdb switching is just being more polite. [09:42:17] ok ok [09:42:43] but feel free to be more polite, there's knowledge to gain in that as well. [10:44:28] hi all, while looking atsome isse i noticed many of the eqiad mw serveres have ip_forwarding switched on (https://phabricator.wikimedia.org/T289679#7314517). is this intentional? (cc akosiaris) [10:46:56] that's... hmmm weird [10:48:52] jbond: could be a copy/pasta mistake from the past [10:49:11] no, there is docker0 interface on mw1319 [10:49:17] wot [10:49:18] but docker is not installed [10:49:55] oh dear, odd [10:50:12] uahhthat is strange, althugh thatwould explain it if docker was installed and started at some point it would have toggled this [10:50:22] maybe leftovers of the dragonfly perf testing (cc jayme) [10:50:42] 58 mw servers have it, 255 mw servers don't [10:50:50] the docker0 interface that is [10:51:08] * jbond gussing a reboot will clear it [10:51:34] daemon.log.3.gz:Aug 4 10:53:34 mw1319 dockerd[14581]: time="2021-08-04T10:53:34.330975526Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" [10:51:35] ok I will reboot 1319, it is in eqiad anyway [10:51:38] Aug 4 I see [10:51:40] don't [10:51:50] it's my investigation testbed [10:52:19] looking at SAL [10:52:45] akosiaris: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709719 [10:53:10] https://sal.toolforge.org/log/-kzMEHsBa_6PSCT9cgQ5 [10:53:12] yup [10:53:35] it's dragonfly [10:53:54] ok 10:48 jayme: switch most eqiad appservers to appserver_dragonly role for testing - T286054 [10:53:58] ahh cool [10:54:05] ok, so a remnant of the test [10:54:10] let me fix it quickly [10:55:15] akosiaris: i have sent a cr to manage ip_forwarding (https://gerrit.wikimedia.org/r/c/operations/puppet/+/715217) not planning to deploy today of course [10:57:26] I 've ran 'ip ro ls dev docker0 && sysctl net.ipv4.ip_forward=0' on all mw hosts. That fixes the ipv4.*.forwarding stuff as well [10:57:36] thanks [10:58:11] jbond: so for some of these roles, it's the components that actually manage that [10:58:34] IIRC both docker AND kubelet/kube-proxy will try to set ip_forward [10:59:27] pkg/proxy/ipvs/proxier.go: // Set the ip_forward sysctl we need for [10:59:36] as far as i can see only the openstack roles manage that, yes the damons them self will switch it n when starting. but nothing in puppet explicitly sets it. perhaps we dont need to (this wont solve the issue i was originaly tryig to solve anyway) [11:00:18] or we could disable it by default with a low priorityand let roles that need it turn it own in the appropriateprofile e.g. docker::engine [11:00:25] my point is that if puppet tries to manage it and the daemon also try to manage it, now you got 2 sources of truth. as long as they agree, fine, but once someone merges a change that makes them disagree ... [11:04:12] the puppet module only manages the sysctl.conf files so it only affects the on boot value not the run time value which the componets change. however can also continue to leave it unmanaged [11:05:25] after what alex said, maybe unmanaged is a bit cleaner [11:28:18] sorry forgot to respond, will abandon [11:42:43] oh, shit. Sorry if that cased trouble :-o [12:24:34] 10serviceops, 10Observability-Logging, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 (10JMeybohm) a:03JMeybohm I've patched the mmkubernetes module to actually be suspende... [12:51:32] 10serviceops, 10SRE Observability (FY2021/2022-Q1), 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10fgiunchedi) Ditto for `mmkubernetes`, code patch at https://phabricator.wikimedia.org/P17094, also pending package + deployment [13:08:16] 10serviceops, 10Patch-For-Review: install racktables on miscweb2002 - https://phabricator.wikimedia.org/T269746 (10Dzahn) I rsynced /srv/org/wikimedia/racktables over from eqiad to codfw and then ran puppet. This means the formerly missing file is there now but puppet adjusted the mysql host to the codfw mast... [13:16:25] 10serviceops, 10SRE, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10fgiunchedi) [13:17:31] 10serviceops, 10SRE, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10fgiunchedi) [13:20:45] akosiaris: sorry to bother again for ORES, last question - any pointers to where the score cache is configured by any chance? (trying to write down some documentation) [13:20:59] IIUC it is stored in another Redis rdb node [13:26:28] elukey: it's the same hiera key. profile::ores::web::redis_host. It's is being passed down via some not so well named puppet classes/profiles to ores::web populating /etc/ores/main-99.yaml IIRC, which is shared between celery and uwsgi [13:26:56] the nitty gritty parts are in ores::web puppet class [13:27:01] akosiaris: ah so it is an instance on the same node [13:27:08] yes yes now I see it, different ports [13:27:18] 6379 and 6380 [13:27:23] perfect thanks :) [13:28:05] 10serviceops, 10Patch-For-Review: install racktables on miscweb2002 - https://phabricator.wikimedia.org/T269746 (10Dzahn) @Kormat @marostegui Per Jaime's comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/715233 I am pinging you guys on the ticket to let you know about changed "misc db usage/expe... [13:31:48] I expanded a bit https://wikitech.wikimedia.org/wiki/ORES/Deployment#Restarting_Redis with what we have done today [13:32:32] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7311810, @Jelto wrote: > There is a ClusterRole named `deploy` already for the aggregation of `view` and `pods/portForward` permissions. So I would prefer using... [13:33:24] kudos to the team for https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details, really nice [13:41:17] <3 now that I look at it: Fixing the link to the Container Details dashboard ;-) [14:07:08] I am trying to figure out how to pull prometheus metrics from the istio pods (I can see with nsenter a nice set of metrics if I curl localhost:15000/stats/prometheus - all envoy metrics etc..) [14:07:19] is there any pointer/docs that I can read? [14:08:11] but I see https://phabricator.wikimedia.org/T287007#7224824 [14:09:34] https://thanos.wikimedia.org/graph?g0.expr=istio_requests_total&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [14:09:38] wow! [14:15:46] it will not be like this for knative [14:15:52] but it is a good start :D [14:26:35] akosiaris: all this fallout from T289737 - I was just about to get back to it from the rsyslog journey :D [14:31:00] jayme: it's a bit weird. I did see that by the actual error is a bit more mistifying [14:31:19] 4s Warning Failed pod/mediawiki-bruce-f78c5cd48-r97lx Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-25-145508-publish": rpc error: code = Unknown desc = context canceled [14:31:51] indeed...I wanted to go back in time to check but failed because of missing logs in logstash. So I got distracted :) [14:31:51] I 've seen context cancelled before because of memory issues, but never on the image pulling path [14:32:08] That is potentially timeout because of extraction time [14:32:37] the context with dockerd is canceled in that case (because it takes >3min or so) to extract the image [14:32:38] yeah, that is way more plausible [14:33:06] staging has hdd iirc [14:33:08] not ssds [14:33:29] yeah, guess it's https://phabricator.wikimedia.org/T284628 then [14:33:43] 2min the timeout is [14:35:16] jayme: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=6&orgId=1&var-server=kubestage1002&var-datasource=thanos&var-cluster=kubernetes [14:35:31] yeah, the disk subsystem is consistently clogged [14:35:35] ouch [14:35:44] so it's starved of IOPS [14:35:53] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10JMeybohm) Smells like T284628 [14:36:06] why is that on 1002 only? [14:38:43] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10akosiaris) ` 4s Warning Failed pod/mediawiki-bruce-f78c5cd... [14:47:15] jayme: I have a nagging feeling this is related to that flink taskmanager running on kubestage1002 [14:50:25] that would be soooo nice :-| [14:51:46] akosiaris: where is that assumption from? [14:52:08] a java process consuming like 25% IOPS on kubestage1002 [14:52:14] the rest being md1_raid1 [14:52:25] iotop -d 5 on the host [14:53:35] hmm. I've not seen a java one there [14:53:55] yeah, I just destroyed the release there to check [14:54:02] ah, eheh [14:54:08] one of the good things about this being staging [14:57:48] sigh, docker won't even reply to docker info commands [14:58:14] dm-6 has 100%util all the time [14:58:52] and dm-0, dm-1, dm-2, but I think those are the docker metadata and pool [14:59:00] dm-6 must be a container, though [14:59:27] docker-9:0-1050147-5a247b50465d4a4cd3c1096f58691d53acb8c286b3ff9a7f8168a92447e22566 [15:00:39] that is not even mounted, so I guess its still cleaning up maybe? [15:02:10] could be. Let me try the hammer approach. I 'll stop docker and restart it [15:02:50] can't say it's cooperating much [15:03:48] the lvm data and metadata daemons are in D state occassionally, md1_raid1 is in D state too at times [15:03:56] and of course dm-6 which I am still not clear what it is [15:04:10] Job for docker.socket canceled. [15:04:11] Job for docker.service canceled. [15:04:13] nice... [15:04:42] I don't know if it's possible to figure out that dm-6 was with the container already destroyed :/ [15:04:58] we could try mounting and looking inside :) [15:10:57] so IOPS wise it's ok now [15:11:13] but I still see 0s Warning Failed pod/mediawiki-bruce-565648dddd-bfdhk Error: context deadline exceeded [15:11:55] # grep ^ /tmp/dm-6/id [15:11:56] c18863344fb3225e51d7a5d891c69e6ea4aef67ecc491da3fcbde9f2d5b78d1f [15:12:39] /tmp/dm-6/rootfs/srv/mediawiki [15:12:45] not flink :) [15:13:07] maybe the new bruce trying to start [15:14:31] sigh, the size of those images... [15:15:33] yeah..it's not going to get better [15:15:43] anyway, not urgent, but it has all the telltale signs of https://phabricator.wikimedia.org/T284628 [15:15:59] inded... [15:16:04] the ssd taint trick was nice for production but it won't work here [15:16:05] I gtg. Have a nice weekend o/ [15:16:17] same here, this can wait until monday [15:16:22] see ya [15:16:25] eheh, yes. unfortunately those nodes will not grow ssd's :D [15:17:04] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, 10Kubernetes: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10akosiaris) I am gonna merge this into T284628 and work on it next week. [15:17:20] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10akosiaris) [15:17:36] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team, 10Kubernetes: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10akosiaris) [15:25:59] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10dancy) Looking forward to the fix! [15:37:44] effie: should I just move ahead with setting up the Toolhub pods with an mcrouter sidecar that works like the one for mw-on-k8s? [15:38:24] I have had the chance to review the patch [15:38:53] given we do not have any better memcached solution to offer you, that is how it should be [15:39:21] do you need cross-dc key replication ? [15:40:11] today, no. This is going to launch as active/passive because of database things not being in a place to allow multi-master. [15:40:39] It won't hurt anything to have them replicated, but it is also not necessary [15:42:26] ok, we will have to come up with a simple and sensible mcrouter config for the service [15:48:47] * bd808 tries to grok charts/mediawiki/templates/mcrouter/_config.json.tpl [15:50:28] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10dancy) [20:11:49] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Niharika) >>! In T288844#7296789, @Huji wrote: > My understanding is that the changes in the data are minimal from one version to the n... [20:13:08] 10serviceops, 10SRE Observability (FY2021/2022-Q1), 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10colewhite) >>! In T210137#7314353, @fgiunchedi wrote: > Ok I have a working patch to parse `omfwd` messages at https://phabricator.wikimedia.org/P17091, p... [22:20:27] 10serviceops, 10SRE, 10Patch-For-Review: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10Legoktm) Just some unsorted thoughts: * Can we set the timeout to 120s (the MW request timeout) to see how long the request is actually taking, and whether cold caches is a reasonable thing to blam... [23:08:36] 10serviceops, 10SRE, 10conftool, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Legoktm) 05Open→03Resolved [23:09:02] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, and 2 others: Clean up cron-specific elements of switchdc cookbooks - https://phabricator.wikimedia.org/T289078 (10Legoktm) 05Open→03Resolved I think this is all done now, woot! [23:09:38] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: Services without a service IP cannot automatically be switched by the switchdc cookbook - https://phabricator.wikimedia.org/T285707 (10Legoktm) [23:11:19] 10serviceops, 10SRE, 10Datacenter-Switchover: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Legoktm) @fgiunchedi do you have any pointers on what switching to encrypted rsync entails? Is it just a puppet setting somewhere? [23:15:23] 10serviceops: Release a 1.16 tag of docker-registry.wikimedia.org/golang - https://phabricator.wikimedia.org/T283425 (10Legoktm) golang1.16 only exists in Debian bookworm so far (https://tracker.debian.org/pkg/golang-1.16), so it would need to be backported to bullseye. Or we use an upstream source instead of De...