[13:39:16] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10397071 (10SLyngshede-WMF) To trigger webauthn for select users, we'll just reuse the groovy script from u2f and set the mfa-method field in LDAP to mfa-webauthn ` cas.authn.mfa... [14:43:44] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [14:48:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [14:53:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:19:48] cdanis: o/ docker-pkg updated to 4.0.3, jaeger images built + published [15:20:30] oh thank you elukey, I had it on my list for today πŸ˜… [15:24:19] patch lgtm, just sent you one small tweak to the tests I just thought of https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/1102330 [15:25:46] +1ed! [15:34:51] thanks again for cleaning up my obviously-wouldn't-work mess ;) [15:35:37] I learned how to build and deploy docker-pkg! :D [15:35:45] not that now I am happier but.. [15:35:48] :D [15:35:51] tbh we should move it to gitlab [15:36:05] definitely yes [15:37:23] maybe we can even teach the trusted runners to publish to pypi ;) [15:38:14] #future [15:45:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:55:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:24:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017 (10RobH) 03NEW [17:25:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10398169 (10RobH) @ayounsi or @cmooney: These two switches will arrive in December. Would one of you be able tot update this task with the cabling directions to... [17:25:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10398171 (10RobH) [17:26:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10398173 (10RobH) [17:33:44] Just did some Java updates...do y'all have a script that checks if the java services have been restarted since X, or is that just a one-liner/cumin cmd? [18:01:10] the only thing I know of in that category inflatador is https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/base/files/wmf-auto-restart.py and I don't see any Java support there (assuming the JRE doesn't keep all of its classpath open() forever, which I'm guessing it doesn't) [18:01:52] (see also https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/manifests/auto_restarts/service.pp for the convenient if irrelevant-to-you puppetization) [18:03:42] I've never worked on any of our Java services here directly though, so I don't know for sure [18:04:19] cdanis thanks, I might concoct an evil brew of https://docs.ansible.com/ansible/latest/collections/community/general/listen_ports_facts_module.html ... I also think cadvisor can (optionally) record process info. All that to say I'm well into bad SRE overengineering territory ;P [18:04:43] ansible? [18:05:03] yeah, just something I can run from my laptop to see where I'm at [18:05:37] what did you update exactly, the JRE or the application or ? [18:07:12] the deb pkg/JRE. moritz-m actually does the hard work...he creates tickets like https://phabricator.wikimedia.org/T377938 [18:08:02] Then we restart the services to complete the update. I just had a cookbook failure halfway thru the wdqs hosts, so I wanted to quickly check where I'm at [18:08:31] in that case I do think wmf-auto-restart will work for you -- the old JRE executable should indeed be a deleted file [18:10:32] but I'd still need to install it and run it individually on each host? or is there a cookbook creates a report or something? [18:10:48] looks like the --dry-run option would do what I want [18:11:06] πŸ’™cdanis@cumin1002.eqiad.wmnet ~ πŸ•β˜• sudo cumin 'A:wdqs-all' 'for P in $(pidof java) ; do ls -l /proc/$P/exe ; done | grep deleted' [18:12:26] * inflatador gives it a shot [18:13:10] I'll have to look at cumin output options...I imagine that wouldn't look great when running again ~200 elastic hosts or whatever [18:13:18] sudo cumin -o txt 'A:wdqs-all' 'for P in $(pidof java) ; do ls -l /proc/$P/exe ; done | grep deleted >/dev/null && hostname' [18:13:44] some more shell one-liner fun at https://wikitech.wikimedia.org/wiki/Cumin#Output_handling [18:14:21] you could cookbook-ize this ofc, although I would also (for stateless services, anyway, where a restart is cheap) set up profile::auto_restarts::service [18:14:51] Thanks, I'll take a look. Unfortunately we can't auto_restart most of our java stuff ;( [18:14:56] it would also be fine and good to add a feature to the puppetization that perhaps runs in dry-run mode for a service, and exports a prometheus textfile exporter [18:15:10] then you could put it on a dashboard [18:21:53] that's a good idea. I was thinking about running something like `ps aux | phaste` but dashboard sounds more digestible [19:30:23] hello I/F friends - if possible, I'd like get input on how we should proceed in [0] (see latest comment from me for a summary of what I think needs done). it would be particularly useful if anyone happens to know the history / scope / intent of the delegation of ownership. thanks! [19:30:23] [0] https://phabricator.wikimedia.org/T381904 [20:01:55] FIRING: MaxConntrack: Max conntrack at 81.4% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:06:55] RESOLVED: MaxConntrack: Max conntrack at 83.4% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:00:31] swfrench-wmf: thanks for the detailed write up, I'm not sure who on the team has the most knowledge on the subject, happy to bring it up at our next team meeting [22:51:50] thanks, jhathaway!