[01:43:03] 10Traffic, 10Continuous-Integration-Infrastructure, 10SRE: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10Legoktm) [04:14:02] 10Traffic, 10Infrastructure-Foundations, 10SRE: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) For now we are considering only the 'request_time_ms'. We are taking request time for all the probes/pulses and g... [04:20:14] 10Traffic, 10Infrastructure-Foundations, 10SRE: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) **Probenet Results:** - Belarus (BY) {F37104295} - Czechia (CZ) {F37104297} - Kazakstan (KZ) {F37104299}... [07:27:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133 (10ayounsi) [07:28:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133 (10ayounsi) [07:34:01] hey folks, ready to move all vk instances in ulsfo to pki if you are ok [07:34:14] https://gerrit.wikimedia.org/r/c/operations/puppet/+/929963 [07:46:04] cc: btullis: --^ [09:01:43] proceeding with the rollout [09:04:10] cp4038 worked nicely, also verified via kafkacat [09:06:16] elukey: nice :) [09:08:36] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmne... [09:12:48] really happy for the outcome, in theory after this we should be able to use only PKI certs for all kafka clients and clusters [09:14:30] great [09:26:41] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [09:39:11] aaand done, all ulsfo vk instances upgraded [09:39:41] 👍 [09:54:46] (SystemdUnitFailed) firing: haproxy_stek_job.service Failed on cp5019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:42] (SystemdUnitFailed) resolved: haproxy_stek_job.service Failed on cp5019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:13] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmnet wi... [11:41:59] vgutierrez: if you have some free time today could you give https://gerrit.wikimedia.org/r/c/operations/puppet/+/929674/ a look please? The service is pretty low-traffic so this is relatively low risk [11:42:18] (this is similar to but unrelated to the device-analytics thing I was previously asking you about :D) [11:53:00] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmne... [12:18:48] hnowlan: sure, do we have a beta cluster version of this? [12:33:53] XioNoX: did we have any kind of work on row A4 management network in eqiad? [12:34:07] a quick search on phabricator doesn't show anything [12:34:21] and we lost mgmt in two servers there (cp1075 and cp1076) [12:35:09] hmm 25 servers impacted [12:35:13] so yeah, not just the cp ones [12:35:59] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) @ayounsi removed 8 cables. deleted from netbox [12:36:49] T339168 [12:36:50] T339168: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 [12:41:06] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [12:41:12] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) 05Open→03Resolved Awesome, thanks! [12:41:42] vgutierrez: 302 dcops and jclack who is onsite [13:16:47] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/dns/+/930293 and see also https://people.wikimedia.org/~jameel/ProbeAnalysis/files/plots/plots1/baw/ [13:17:09] going through the CR right now :) [13:17:56] vgutierrez: I can see about spinning one up [13:20:16] cdanis: afganistan seems like a good one too to re-map [13:20:33] XioNoX: yeah, we didn't want to do everything at once, also it's still a small sample size there [13:20:37] only 8 measurements for each dc [13:20:43] makes sens [13:21:17] https://people.wikimedia.org/~jameel/ProbeAnalysis/files/plots/plots1/baw/NotAvailable%20(NotAvailable).png :) [13:21:24] ahha yes [13:21:26] there's also one for None [13:22:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) I will be working with @Clement_Goubert today at 10am CT to relocate those mw nodes. [13:22:21] cdanis: BTW.. some plots swap drmrs and esams.. tricky for reviewing :) [13:22:41] vgutierrez: they're in the order they are currently in geo-maps [13:22:52] never taking sides: https://people.wikimedia.org/~jameel/ProbeAnalysis/files/plots/plots1/baw/Switzerland%20(CH).png :) [13:22:53] so if an individual plot doesn't look something like a descending staircase, something is off [13:23:00] XioNoX: ahahaha [13:23:08] oh gotcha [13:24:46] they're also color-coded by order, maybe we should change that [13:39:48] nice work jameel and cdanis! [13:41:00] 10Traffic, 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) Next steps: * Roll out the changes to eqsin, and monitor. * Roll out the changes to codfw, and monitor. * Roll out the changes to eqiad, and monitor. * Roll out the ch... [14:52:49] 10Traffic, 10Continuous-Integration-Infrastructure, 10SRE: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) [14:56:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Jhancock.wm) [14:57:20] 10Traffic, 10Continuous-Integration-Infrastructure, 10SRE: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) a:03hashar That is a recurring issue cause the Jenkins jobs are running on static hosts which are not always entirely cleared up after a... [14:58:36] 10Traffic, 10Continuous-Integration-Infrastructure, 10SRE: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) [15:01:48] 10Traffic, 10Continuous-Integration-Infrastructure, 10SRE: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) > The apt cache overflowing, I don't think it is garbage collected `/srv` is 21G on the instances and: | Disk size in MB | Directory |--|... [15:04:10] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [15:09:45] 10Traffic, 10Continuous-Integration-Infrastructure, 10SRE, 10ci-test-error: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) 05Open→03Resolved I have manually deleted the apt caches which were taking half of the disk space and are never purg... [15:29:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:37:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:38:30] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:39:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) 05Open→03Resolved This is complete, thanks to @ssingh and @Clement_Goubert [15:54:25] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:57:15] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:10:42] (SystemdUnitFailed) firing: acme-chief.service Failed on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:49] 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10ssingh) [16:30:51] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:44:35] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [16:44:45] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) 05Open→03In progress Note: I started to boostrap the node with instructions from https://wikitech.wikimedia.org/wiki/P... [16:48:56] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) Also, `designate-producer` is complaining about something related to rabbitmq, possibly related to the new IP address: `... [16:52:18] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmnet wi... [18:56:07] 10netops, 10Infrastructure-Foundations: IC-307235 down yet again - https://phabricator.wikimedia.org/T339289 (10CDanis) [19:19:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Jhancock.wm) [19:58:48] 10Traffic, 10SRE, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10ZauberViolino) Is the Wikipedia app is available on Apple's App Store? (My iPad region is US so I cannot check... [23:37:02] 10netops, 10Infrastructure-Foundations, 10ops-codfw: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) [23:37:24] 10netops, 10Infrastructure-Foundations, 10ops-codfw: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) p:05Triage→03Medium [23:52:29] 10netops, 10Infrastructure-Foundations, 10ops-codfw: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul)