[03:33:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:33:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host testvm2006.codfw.wmnet with OS bookworm [11:07:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host testvm2006.codfw.wmnet with OS bookworm completed: - testvm2... [11:11:35] jbond, moritzm: FYI I'll make a spicerack release after lunch [11:20:07] ack cheers [11:20:20] great, thx [11:39:42] (SystemdUnitFailed) firing: (2) uwsgi-puppetboard.service Failed on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:27] jbond: yeah that's great! even standardizing it across SRE would be awesome [11:41:37] if it was me I'd replace "nit" by "advice" or similar, and "nit, opt" with "nit". As the two make it confusing [11:42:19] XioNoX: yes across sre would be great, also happy to make the change suggested thanks [11:43:02] the puppetboard1003 alerts is one of you working on it? [11:43:11] nit: make the SLA depends on the patch length/complexity :) [11:43:31] volans: ahh yes sorry i can silence puppetboard1003 alerts [11:44:00] XioNoX: yes i think if it was going to be a moreformal team/department thing then the sla text would need a bit more work [11:44:56] the problem is that it's hard to use the same meaning for all repos IMHO [11:45:07] yeah, the SLA can be a long discussion [11:47:01] also afaik, the current, mostly informal, way of doing code reviews works fine [11:47:09] unless there are issues I'm not aware of [11:49:33] the main reason i added this is that i have had people in private ask me what it means when i comment nit: but give a +1 and also just what +1 vs 0 vs -1 means. As such i just wanted to get something to point people to [11:50:40] i thin k the current way of code reviewes works but id also say it could use some improvments, especially when interacting with other sub teams or departments, however i suspect that a lot of this could be a missunderstanding or difference of expectations which the team APIwill hopefully help with [11:51:12] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) @jbond I tested https://gerrit.wikimedia.org/r/916509 on the GitLab hosts but the change is noop and no new oauth provider is av... [11:51:19] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1001 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**PASS**)... [11:51:31] noted, so yeah making an I/F standard is the best next step [11:52:07] do we know if other (sub-)teams already have something like that in place? [11:52:45] i have n ot seen anything, there is some text on mediawiki but they basicly use +2 in the same way we mostly use +1 from what i can tell [11:53:31] https://www.mediawiki.org/wiki/Special:MyLanguage/Gerrit/Code_review#Complete_the_review [11:53:50] otheres may no of something and im sure there is some history of this all failing oin the past :) [11:54:27] yeah and see today's thread about accidental +2 [11:54:52] yes exactly that policy is not too compatible gate and submit [11:55:08] (where the original author probably wants to deploy) [11:55:19] slyngs: if you want the test-cookbook script is available on the cumin hosts if you want to be one of the first beta-testers [11:55:50] volans: I just finished testing :-) [11:56:10] jelto: you were interested too ^^^ [11:56:14] :) [11:56:36] no worries slyngs [12:03:48] thanks I'll try that soon when I have a new change 🥳 [12:13:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul when you are back can you advise on the status of these? They all appear as connected on asw-b1-codfw... [12:29:42] (SystemdUnitFailed) firing: (6) cadvisor.service Failed on ganeti2019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:42] (SystemdUnitFailed) firing: (9) cadvisor.service Failed on ganeti1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:42] (SystemdUnitFailed) firing: (9) cadvisor.service Failed on ganeti1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:42] (SystemdUnitFailed) firing: (9) cadvisor.service Failed on ganeti1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:16] godog: ^ [13:06:38] XioNoX: yeah, puppet runs in progress [13:09:42] (SystemdUnitFailed) firing: (10) cadvisor.service Failed on ganeti1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:42] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:42] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:42] (SystemdUnitFailed) resolved: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:51] I think we have some regression with the decom cookbook: [14:20:19] moritzm: what's up? [14:20:21] 1) for the bast2002 decom I ran the cookbook (even twice since I suspected the odd race we have from time to time) and it's still in puppetdb: https://phabricator.wikimedia.org/T336995 [14:20:53] last time I checked I suspected a puppetdb issue tbh [14:20:58] 2) the decom of labstore1004/1005 also went amiss, I can still see both hosts in puppetdb (I noticed it when deploying the libwebp update since they were faiiling) [14:21:14] https://phabricator.wikimedia.org/T337269 [14:21:39] ah, that task mentions a traceback even: https://phabricator.wikimedia.org/T337269#8872155 [14:26:56] moritzm: did you get the same stacktrace? [14:29:31] no, I checked and for the bast2002 decom there's no apparent traceback/error logged [14:29:55] ok [14:30:08] bast2002 is soonish garbage collected from puppetdb anyway [14:31:16] when was it decomm'ed? [14:32:41] first attempt was on May 19, and then a second time on May 22 an then on May 24 Dc ops unracked the hw [14:33:39] 2023-05-22T06:40:56.845Z INFO [p.p.command] [8560884-1684737656824] [18 ms] 'deactivate node' command processed for bast2002 [14:34:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:56] the previous one was at [14:34:57] puppetdb-2023-05-19.0.log.gz:2023-05-19T10:41:41.566Z INFO [p.p.command] [8056505-1684492901551] [13 ms] 'deactivate node' command processed for bast2002 [14:36:40] hmmh, we had the odd race where the deactivate node races with an ongoing puppet run, but twice? [14:37:19] I'm more worried of cross-dc puppetdb things [15:07:41] crosser-dc shouldn;t matter two much buoth puppetdb daemons use the same primary postgress db [15:34:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:13] (DiskSpace) firing: Disk space idp1002:9100:/ 5.902% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace