[07:08:43] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9725186 (10LSobanski) Additional BGP WARNING alert that showed up today: ` AS38082/IPv6: Active (for 65d14h), AS5398/IPv6: Active (for 118d16h), A... [07:51:58] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9725253 (10cmooney) I'll take a look and clear up what I can. [08:00:10] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9725264 (10cmooney) 05Open→03Resolved This one in particular down for almost a year and IPs are not responding to ARP/ND on the LAN. Peerin... [08:12:27] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9725288 (10Fabfur) I'd like to join the chorus of thanks to Papaul, you resolved us a very nasty and long running issue here! Thanks again! [09:02:18] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9725479 (10cmooney) >>! In T350179#9725288, @Fabfur wrote: > I'd like to join the chorus of thanks to Papaul, you resolved us a very nasty an... [09:07:27] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9725493 (10jcrespo) Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is [[ https://aler... [09:14:14] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9725545 (10cmooney) >>! In T362421#9725493, @jcrespo wrote: > Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a815... [10:53:43] Hello. I'd like to merge this small change to the trafficserver config today, if possible: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020798 [10:54:47] Are you happy for me to run `cumin A:cp run-puppet-agent` to apply it synchronously, or would you prefer it to be staggered? [10:55:28] or perhaps `cumin O:cache::text run-puppet-agent` is better. [10:55:47] hi, checking [10:56:24] thx [10:58:02] I don't think it's too "dangerous", you could apply manually to an host and target it just to check that everything works as expected and then let puppet converge (if there's no problem in some host behaving differently from others for a small time) [11:03:49] Thanks. Ideally, it would be a big bang migration with as short as possible a window of downtime, since this is a host (with its own database) that just runs on a single VM. I have to dump, transfer, restore the database so I was hoping to apply the puppet change fairly synchronously on the cp hosts, but only if you're happy with that. [11:04:46] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9725885 (10gmodena) >>! In T351117#9688466, @gmodena wrote: > Next steps: now that we are starting to collect more logs, we c... [11:19:37] A:cp-text matches 48 hosts, maybe put a -b 30 to be on the safe side [11:21:13] +1 to both trying on one host and spreading it out a bit [11:21:43] (in case you go he way of forcing puppet) [11:36:26] 06Traffic, 06SRE, 06Wikimedia Enterprise: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628#9725986 (10Ottomata) 05Resolved→03Declined Hello! I don't think this task is resolved. Perhaps you meant to decline it? Being bold and d... [11:36:58] OK, thanks all. [12:34:47] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9726133 (10Fabfur) I agree with @gmodena on all topics, more specifically: * About the `sequence` issue, that's the most pla... [12:41:48] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9726146 (10Ottomata) > We could append (or prepend) other information pieces to the sequence number (like the haproxy process... [13:06:53] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9726265 (10ssingh) Thanks @jcrespo! I should have silenced the alert or restarted the service; both of those are in progress now so we should see this resolve soon. @cmooney:... [13:12:25] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9726287 (10gmodena) > About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other info... [13:47:35] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9726516 (10Fabfur) >>! In T351117#9726287, @gmodena wrote: >> About the sequence issue, that's the most plausible hypotheses.... [13:59:16] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9726548 (10JAllemandou) I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values... [14:19:41] godog: I'm working on replacing mtail with benthos on the ncredir cluster, I'm having a small issue with the current benthos puppetization, (see https://puppet-compiler.wmflabs.org/output/1021485/1997/ for ncredir2002) [14:20:11] godog: basically the current benthos puppetization assumes that we are using benthos to send data to a kafka cluster, but on ncredir I only need it to produce prometheus metrics [14:20:32] could we make the kafka stuff on profile::benthos optional? [14:34:25] (SystemdUnitFailed) firing: user@0.service on cp6004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:14] ^ restarted [14:39:28] vgutierrez: nice re: replacing mtail, yes definitely on generalizing the benthos puppetization and make kafka optional [14:44:25] (SystemdUnitFailed) resolved: user@0.service on cp6004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:04] godog: ack, I'll give it a go [14:45:34] vgutierrez: sweet, thank you! happy to review patches / assist [15:29:30] 10netops, 06Infrastructure-Foundations, 06SRE: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902 (10Fabfur) 03NEW [15:30:22] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021502 [16:03:27] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9727085 (10xcollazo) >>! In T351117#9726548, @JAllemandou wrote: > I think @Ottomata 's idea is good: having another column m... [16:26:42] vgutierrez: nice, took only a quick and looks sane/straightforward, I'll take a closer look next week [16:50:28] Thanks godog [17:04:48] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9727407 (10ovasileva) p:05Triage→03High [18:03:35] sukhe: fabfur: dumb question what's the expected timeline on having magru ready to serve? [18:04:27] cdanis: not dumb at all :) I think we should be up and running by Friday of next week (April 26) if all goes well and network equipment is delivered on time by Monday [18:04:39] sweet [18:05:25] in that case the easiest thing to do will be to just submit the patch against mediawiki/extensions/WikimediaEvents on Friday, and let it ride the branch cut Monday / train deploy that week [18:05:30] when we turn it on -- we will see but I think unlikely on Friday? so that would mean Monday April 29. there was some discussion on maybe doing a Friday turn-up given weekend traffic tends to be on the lower side [18:05:46] that would be possible too sure [18:06:14] I'm just thinking of the patch for Probenet, we can let it ride the Mediawiki train, or we can do a backport deploy (which is easier now than it used to be, thanks `scap backport`) [18:06:36] happy to help with that part once we're there [18:07:10] and in the meanwhile, I'll make sure I can still find and run the data analysis scripts 😅 [18:07:28] whatever you think is safest: technically the site will be up and running on Friday (I think) so if you are not worried about the user getting connection errors against, we can do the Friday patch and let it go with the train on Monday [18:08:06] and yeah thanks for the offer of course [18:09:34] I guess put differently, does it matter if merge the patch at the same time we turn on the site, or we merge it in advance, or we merge it later (after the site has been up and running and serving real traffic) [18:10:25] yeah sorry, I was being unclear -- I'd rather not merge the patch when the measurement endpoint DNS resolves but doesn't respond, because that has the worst implications on user-agent performance and energy consumption [18:11:36] I think it's fine to merge the patch while any of the following are true: a) the measurement endpoint gives NXDOMAIN, b) the IP returns connection refused, or c) the IP answers, serves TLS, and returns a 200 OK [18:12:26] I'm probably worrying over nothing, as it only will trigger a probe a very small amount of a time and only at most once per user, but still, we've never actually ran it in that state [18:12:42] yeah also we can control when to do it in this case, so in that sense, why not [18:18:34] I guess our perspective is that there are a lot of eyes on this data and we have control on its rollout, so whatever works best for you, we will try to do that! [18:22:14] 10Domains, 06SRE: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921 (10EdErhart-WMF) 03NEW [18:36:51] I agree on merging & turning it on on Monday (if all goes fine and Friday the DC is fully operating) [19:11:08] 06Traffic, 10DNS, 06SRE: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921#9727972 (10ssingh) Discussed a bit with @EdErhart-WMF on what the goal is here on Slack and will update this task later when there is more clarity. [23:45:11] 06Traffic, 10ops-esams, 06SRE: cp3079 bios settings - https://phabricator.wikimedia.org/T349314#9728431 (10Dzahn) [23:46:24] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp3079 bios settings - https://phabricator.wikimedia.org/T349314#9728432 (10Dzahn) [23:49:32] 06Traffic, 06SRE: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743#9728435 (10Dzahn) p:05Triage→03Low latest comment on T345809 and the merge from October 2023 sound like this is basically declined?