[03:03:48] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:47] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:56] 10Mail, 10Infrastructure-Foundations: Add Auto-Submitted: auto-generated header to emails sent by scripts - https://phabricator.wikimedia.org/T347835 (10ayounsi) This seems to be working well, the header is present, no related auto-reply so far. [06:48:47] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:33] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) [07:52:14] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) Seems to be caused by https://gerrit.wikimedia.org/... [08:08:47] (SystemdUnitFailed) firing: (4) uwsgi-puppetdb-microservice.service Failed on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:20] morning [08:53:46] seems that there could be an issue w/ PCC/ PuppetDB: [08:53:48] https://puppet-compiler.wmflabs.org/output/963004/43823/ [08:54:24] this is the second change today that gives me this error (both cases same hosts, same error) [08:54:46] the error is [08:54:49] https://www.irccloud.com/pastebin/xtlvmTBT/ [08:55:53] it doesn't seem the usual lack of facts for a given hosts but more puppetdb not working, jbond ^^^ [08:56:07] I'm about to jump in a meeting, so can't dig right now, sorry [08:57:09] fyi folks I changed my leave from tomorrow to today so will be offline rest of the day, if there is an emergency reach out of course [08:57:55] fabfur: i have seen that error a couple of times its normally transient. whats the change? [08:58:20] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [08:59:09] topranks: ack, enjoy [08:59:38] fabfur: jbond: that's presumably T347934 [08:59:39] T347934: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 [09:04:42] jbond: the change was https://gerrit.wikimedia.org/r/c/operations/puppet/+/963004/ [09:14:12] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) 05Open→03Resolved a:03taavi [09:15:38] fyi taavi fabfur i think this is unrelated to the wmf-certs issue. the pcc puppetdb has been running for some weeks [09:17:02] fabfur: there is no cp4040.ulsfo.wmnet host? [09:17:29] yes, there is (I can swear it! :D ) [09:18:17] oh wait its the bastian host that changed :) [09:18:34] * jbond shuld fully read error messages [09:19:04] * fabfur too [09:39:52] is there something I should do to fix this? [09:41:03] sorry, don't really know how pcc works, you can also say to sacrifice a black rooster and I'll believe... [09:44:52] fabfur: no its allright im looking at and reimporting the puppet facts [09:59:55] fabfur: https://puppet-compiler.wmflabs.org/output/963004/43830/ [10:10:32] 👍 [11:08:52] (SystemdUnitFailed) firing: (4) uwsgi-puppetdb-microservice.service Failed on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:47] (SystemdUnitFailed) resolved: (2) uwsgi-puppetdb-microservice.service Failed on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:18] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [12:03:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10ayounsi) [12:05:42] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) Asked Juniper about their timeline on getting this setup. [12:06:27] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) [12:22:13] (DiskSpace) firing: Disk space idp2002:9100:/ 5.985% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:35:20] 10netbox, 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10ayounsi) I think here the only/best option is to reduce the time delta between when a server is connected and when switch port is configured (line `Run the sr... [12:56:59] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) [13:05:13] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10ops-esams: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) 05Open→03Resolved a:03RobH I believe this is all done. [14:17:13] (DiskSpace) resolved: Disk space idp2002:9100:/ 5.73% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:17:43] (DiskSpace) firing: Disk space idp2002:9100:/ 5.961% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:07:05] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) 05Open→03Resolved Homer 0.6.4 released. [18:16:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [18:17:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [18:17:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) p:05Triage→03Medium [18:17:43] (DiskSpace) firing: Disk space idp2002:9100:/ 5.497% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:31:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) We can and probably should have a backup static routes for each of `ns[01]` but it can be to a single host instead of al... [19:52:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10BBlack) Looks about right to me! [21:49:18] 10netbox, 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10wiki_willy) ++ @Papaul , who's going to dig around a bit and provide some feedback [22:17:58] (DiskSpace) firing: Disk space idp2002:9100:/ 5.445% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace