[01:45:13] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:40] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @Ladsgroup During this operation, replication codfw -> eqiad is still active, so as there are codfw masters involved (even if codfw will b... [07:15:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:35] good morning, can I suggest to revisit the policy for the alerts in this channel? So far all those that I've seen were not actionable for me (and in multiple case actually unrelated to our team). I ended up just ignoring them (visually, not ignoring jinxer-wm in IRC terms) and for me they just clutter the channel [07:41:25] +1 or we need to commit to act on those alerts, it's the same few that come in regularly [07:45:33] specifically the image build/report ones are being acted on, but it takes some time [07:45:45] but on a general note we're not really the primary point of contact [07:46:11] I've been fixing these up under the wider "fix the stretch removal fallout" [07:47:11] but it's more of a general SRE thing (with ServiceOps being the primary users) [07:47:37] it's like this alert we have about puppet manifests which change with every run [07:47:56] it's being alerted for the puppetdb hosts, but the actionables are for pretty much every SRE team [07:47:59] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi) > The way to write to IRC is already present, see https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.irc_logger I'd be a b... [07:50:28] I see, so yeah annoying position to be in [07:51:26] I guess longer fix is to not have that specific check but instead "sub" checks that are for each relevant teams? [07:52:16] that. or some team needs to specifically own the image maintenance in general [07:53:00] but those sub checks will also need more time/prep work, currently we e.g. also lack annotations who's actually using/maintaining an image [07:53:41] image maintenance in general needs more work around ownership, expected reaction times etc. [07:54:14] maybe bold move, but should we downtime/ack that specific alert forever with the relevant task as reason? [07:58:30] this specific alert I have ends soon, it's been a bit of a whack-a-mole, there's layers of errors and when you fix up one, it proceeds and then runs into fresh ones [07:58:45] maybe this silences when I've merged https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/911761 [07:59:17] at least the reports for docker-reporter-k8s-images.service are already gone, it's just base left [08:04:03] awesome, thanks! [08:30:50] quick sanity check for https://gerrit.wikimedia.org/r/c/operations/puppet/+/912783/ anyone? (followup to yesterday's switchover) [08:43:11] I'm getting tired of singtel... https://phabricator.wikimedia.org/T335475 [09:04:12] Do you need me to write a script to create an automatic ticks in their systems when Icinga alerts :-) [09:10:10] slyngs: yes please, not on icinga though but on this cookbook: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/network/debug.py to open a phab task and a provider ticket, next step will be to have the cookbook been triggered by icinga or librenms or alertmanager [09:11:20] That would actually be kinda fun to do :-) [09:11:59] be my guest, I'd be happy to help [09:12:37] longer term it should integrate with https://phabricator.wikimedia.org/T230835 to not trigger if there is an ongoing maintenance [09:35:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:35] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10fgiunchedi) [13:45:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10fgiunchedi) >>! In T335027#8795348, @ayounsi wrote: > Thanks for the quick reply! This now works: > ` > prometheus1006:~$ curl lsw1-e8-eqiad.... [13:57:24] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) Nice! the first task got created and resolved! {T335403} Would it be possible to improve the description: > descr... [14:03:01] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10fgiunchedi) >>! In T333007#8804137, @jbond wrote: >>>! In T333007#8804112, @jbond wrote: >>>>! In T333007#8803629, @ayounsi... [15:30:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:13] (SystemdUnitFailed) firing: (8) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:20] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti: consider --no-wait-for-sync as a default option for instance creation - https://phabricator.wikimedia.org/T335522 (10herron) p:05Triage→03Medium [16:35:13] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:17] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:13] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:43:04] FYI Netbox 3.5.0 is out: https://github.com/netbox-community/netbox/releases/tag/v3.5.0 [18:47:10] it's becoming urgent to upgrade [18:48:53] * volans hides [19:17:03] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10BCornwall) @ayounsi Thanks for the report! I have a naive question: Would it be possible/more correct to interface confctl/etcd rather than a cookbook? That (by my observa... [19:35:13] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:39] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10BBlack) I like this direction (etcd). It's not super-trivial, but we've complained a lot even internally about the lack of etcd support for depooling whole sites at the p... [23:08:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10colewhite) [23:35:13] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed