[00:01:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:20] off, cya [00:02:28] no alerts today [00:11:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-coord1003.mgmt.eqiad.wmnet with reboot policy FORCED [00:15:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-coord1004.mgmt.eqiad.wmnet with reboot policy FORCED [00:15:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host an-mariadb1001.mgmt.eqiad.wmnet with reboot policy FORCED [00:16:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host an-mariadb1002.mgmt.eqiad.wmnet with reboot policy FORCED [00:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:36:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-mariadb1001.mgmt.eqiad.wmnet with reboot policy FORCED [00:36:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-mariadb1002.mgmt.eqiad.wmnet with reboot policy FORCED [00:38:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) [00:40:18] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-coord1003'] [00:41:01] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-coord1004'] [00:54:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['an-coord1003'] [00:54:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['an-coord1004'] [01:02:31] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-coord1003'] [01:02:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['an-coord1003'] [01:03:15] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-coord1004'] [01:03:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['an-coord1004'] [01:04:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-mariadb1001'] [01:04:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-mariadb1002'] [01:05:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) [01:21:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['an-mariadb1001'] [01:21:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['an-mariadb1002'] [01:25:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-mariadb1001'] [01:25:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['an-mariadb1001'] [01:26:01] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-mariadb1002'] [01:26:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['an-mariadb1002'] [01:30:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:30:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) [01:34:23] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [01:36:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host druid1010.mgmt.eqiad.wmnet with reboot policy FORCED [01:42:46] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:52:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [01:53:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [01:54:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [01:55:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [01:57:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:03:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [02:08:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1010.mgmt.eqiad.wmnet with reboot policy FORCED [02:09:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Papaul) [02:09:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host druid1011.mgmt.eqiad.wmnet with reboot policy FORCED [02:17:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:48] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Papaul) @Jclark-ctr when you are back onsite can you please audit all the interfaces below and let me know which server is connected to each interface. The switch is showing that the interfaces are up and someth... [02:22:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1011.mgmt.eqiad.wmnet with reboot policy FORCED [02:33:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Papaul) [02:37:39] PROBLEM - Check systemd state on apifeatureusage2001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:55:03] RECOVERY - Check systemd state on apifeatureusage2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:08] (03PS1) 10Legoktm: gitlab: Allow use of any rustlang/rust image [puppet] - 10https://gerrit.wikimedia.org/r/879693 (https://phabricator.wikimedia.org/T326515) [05:04:32] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) [05:10:32] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) The fix for T320734 is now deployed, so the ESI comment string for testing is now being added to the base HTML for both desktop and mobile sites. So, once the... [06:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230113T0700) [07:40:32] (03PS1) 10Slyngshede: c:idm enable tls in Apache [puppet] - 10https://gerrit.wikimedia.org/r/879699 [07:47:15] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230113T0800) [08:02:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:03:55] (03PS2) 10Slyngshede: c:idm enable tls in Apache [puppet] - 10https://gerrit.wikimedia.org/r/879699 [08:04:54] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39123/console" [puppet] - 10https://gerrit.wikimedia.org/r/879699 (owner: 10Slyngshede) [08:06:54] (03CR) 10Muehlenhoff: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [08:10:34] (03PS1) 10Ayounsi: network:external: add wikidough v6 + descriptions [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) [08:11:42] (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/879605 (owner: 10Jbond) [08:12:58] (03CR) 10JMeybohm: [C: 03+1] "I must admit that I'm not a super big fan for two reasons:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [08:13:53] (03CR) 10Ayounsi: [C: 03+1] "To clear the last "BGP alert: misconfiguration"" [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [08:14:05] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:14:27] (03PS2) 10Ayounsi: network:external: add wikidough v6 + descriptions [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) [08:15:40] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [08:18:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/879182 (owner: 10Slyngshede) [08:25:21] there is high errors since 7:45 [08:25:49] (or exceptions) [08:26:25] flow_revision and the external cluster, maybe? [08:26:45] on testwiki [08:28:01] also warnings on mwmaint1002 & query cache, not sure which of the 2 is the one responsible [08:28:59] the graph and kibana is getting different results, so unsure [08:29:59] ok, it is none of those, it is DBTransactionSizeError on enwiki [08:30:24] /wiki/Special:Preferences [08:31:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, merging. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/879420 (owner: 10Majavah) [08:31:27] (03CR) 10Muehlenhoff: [C: 03+2] admin: remove duplicate users from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879420 (owner: 10Majavah) [08:32:14] "/wiki/Special:Preferences Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent 5.811s in writes, exceeding the 3s limit" [08:33:03] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 133 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:35:31] (03PS1) 10Muehlenhoff: Also remove duplicates from absent_ldap [puppet] - 10https://gerrit.wikimedia.org/r/879743 [08:45:35] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:46:10] jynus: PM [08:46:57] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/879742/39124/rpki1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [08:47:54] (03CR) 10Awight: [C: 04-1] "Looks right, just needs to wait until monitoring is in place." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [08:48:27] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:05] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 137 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:00:06] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:03:36] (03PS3) 10Ayounsi: network:external: add wikidough v6 + descriptions [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) [09:03:38] (03PS1) 10Ayounsi: BGPalerter: monitorPathNeighbors bump threshold [puppet] - 10https://gerrit.wikimedia.org/r/879747 (https://phabricator.wikimedia.org/T230600) [09:06:41] !log installing bast3006 T324974 [09:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:45] T324974: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 [09:10:51] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:12:11] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [09:13:58] 10SRE, 10Infrastructure-Foundations: icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10fgiunchedi) Adding back infra foundations and SRE here, though leaving out o11y since I don't think there's any actionable at this time for us [09:14:06] (03PS13) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [09:17:32] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:24:34] (03Abandoned) 10Slyngshede: c:idm enable tls in Apache [puppet] - 10https://gerrit.wikimedia.org/r/879699 (owner: 10Slyngshede) [09:24:42] (03CR) 10Slyngshede: [C: 03+2] CNAME for idm-test [dns] - 10https://gerrit.wikimedia.org/r/879522 (owner: 10Slyngshede) [09:26:15] (03CR) 10Ayounsi: "Thanks, replies inline, only one point need follow up." [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [09:27:00] (03PS4) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) [09:29:17] (03PS14) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) [09:29:45] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:31:30] 10SRE, 10observability, 10User-fgiunchedi: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10fgiunchedi) [09:31:37] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:32:38] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [09:32:52] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [09:33:30] 10SRE, 10serviceops, 10User-fgiunchedi: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10fgiunchedi) [09:33:40] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Re-scheduling ulsfo for Jan 16th at 12:00 UTC [09:34:33] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.089 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:38:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson) [09:39:15] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment Add TLS termination. [puppet] - 10https://gerrit.wikimedia.org/r/879182 (owner: 10Slyngshede) [09:40:48] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Summarizing, what's left on FPC4 in term of physical interfaces, leaving asw2-d-eqiad aside for now, as we're tackling them in T313463: cr1-eqiad: xe-... [09:41:08] !log installing bast4004 T324974 [09:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:12] T324974: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 [09:44:10] (03PS15) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) [09:48:45] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [09:49:50] PROBLEM - Check unit status of acme-chief #page on acmechief1001 is CRITICAL: CRITICAL: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring [09:50:17] let's see [09:50:18] * Emperor is around, known problem? [09:50:35] not sure yet [09:50:36] acme-chief? [09:50:55] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [09:51:10] that's going to be a letsencrypt thing [09:51:17] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:17] systemd has giving up on restarting it because failing fast [09:51:17] I don't see anything on log [09:51:21] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:51:25] I restarted it [09:51:29] Jan 13 09:30:00 acmechief1001 systemd[1]: Started Sync acme-chief certificates. [09:51:29] Jan 13 09:30:00 acmechief1001 acme-chief-certs-sync[21577]: Could not create directory '/nonexistent/.ssh'. [09:51:29] Jan 13 09:30:01 acmechief1001 acme-chief-certs-sync[21577]: Could not chdir to home directory /nonexistent: No such file or directory [09:51:29] Jan 13 09:30:02 acmechief1001 systemd[1]: acme-chief-certs-sync.service: Succeeded. [09:51:33] that was the last logs [09:51:37] slyngs: related to your change [09:51:40] XioNoX: as in, after the p*age? [09:51:43] let's revert [09:51:43] I just run `acmechief1001:~$ sudo service acme-chief-certs-sync status` [09:51:48] er "start" [09:51:51] and it came back [09:51:54] jynus: yeah after the page [09:51:58] XioNoX: no, it's still broken [09:52:02] Jan 13 09:47:44 acmechief1001 systemd[1]: Failed to start acme-chief Service. [09:52:09] ok, seems it it related to some ongoing work, I wil ack [09:52:13] Hmm [09:52:14] that's another unit [09:52:15] ah right it stops again [09:52:24] acme-chief is the one failing [09:52:50] godog did it before me [09:52:51] afaik there is no "acme-chief" on that host, only "acme-chief-certs-sync" [09:52:59] this doesn't cause user visible outage yet, right? [09:53:02] What's the host? [09:53:07] acmechief1001 [09:53:13] Errr [09:53:19] Jan 13 09:47:44 acmechief1001 acme-chief-backend[24976]: challenge_type = CHALLENGE_TYPES[cert_details['challenge']] [09:53:22] Jan 13 09:47:44 acmechief1001 acme-chief-backend[24976]: KeyError: 'dns01' [09:53:38] I'd say revert for now ? [09:53:41] as mentioned above it's a typo in Simon's patch [09:53:42] Hmmm [09:53:47] Yeah [09:53:48] should be dns-01 instead of dns01 [09:53:53] got it, thank you moritzm [09:53:55] Aah, I'll just fix [09:53:58] added here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/879182/3/hieradata/role/common/acme_chief.yaml [09:54:01] revert or patch on top, both would work, I guess? [09:54:08] patch on top [09:54:13] yeah roll forward is fine [09:54:25] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:54:26] slyngs to take action, please :-D [09:54:31] Will do [09:55:18] meanwhile I will double check the usual signals to make sure no impact [09:55:55] (03PS1) 10Slyngshede: P:acme_chief::certificates fix spelling [puppet] - 10https://gerrit.wikimedia.org/r/879749 [09:56:06] what's the most traffic service we serve under let's encrypt? [09:56:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879749 (owner: 10Slyngshede) [09:56:33] jynus: Wikipedia [09:56:36] lol [09:56:37] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18599322656 and 58950 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:56:41] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3896430912 and 58955 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:56:45] (03CR) 10Filippo Giunchedi: [C: 03+1] P:acme_chief::certificates fix spelling [puppet] - 10https://gerrit.wikimedia.org/r/879749 (owner: 10Slyngshede) [09:56:56] jynus for the US DCs [09:57:15] I see, so a problem there would have been very visible [09:57:23] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17338947576 and 58995 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:57:23] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4681211320 and 58995 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:57:23] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 15102211864 and 58995 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:57:25] (03CR) 10Slyngshede: [C: 03+2] P:acme_chief::certificates fix spelling [puppet] - 10https://gerrit.wikimedia.org/r/879749 (owner: 10Slyngshede) [09:57:26] (a user facing one, I mean) [09:58:05] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) @Jclark-ctr could you run (and connect and add the optic on the asw side for) this fiber : https... [09:58:20] (03PS16) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) [09:58:38] jynus: mirrors? [09:58:39] to prevent that error from happening again we could add a CI check which validates that a challenge listed in profile::acme_chief::certificates matches an existing one listed in profile::acme_chief::challenges [09:59:05] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:24] jynus: I'd expect breaking acme would just stop cert renewals from happening, so it'd have to be bust for quite some time before things started expiring [09:59:27] recovery, let's see if the tool catches it [09:59:29] slyngs: I'd say that en.wikipedia.org has slightly more traffic than mirrors [09:59:37] Emperor: I would, too [09:59:44] Emperor: indeed [10:00:00] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [10:00:02] short term impact would be messing with OCSP responses [10:00:02] but in my current role I don't want to guess anything, and double check it [10:00:04] RECOVERY - Check unit status of acme-chief #page on acmechief1001 is OK: OK: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring [10:00:14] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:00:22] there it is the recovery [10:00:26] mid-term some certificates would expire [10:00:26] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 16 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:00:40] moritzm: wanna help creating a ticket for the check? [10:00:59] seems like a good idea for such a critical piece of automation [10:01:08] Nothing should be impacted for a 30 minutes downtime [10:01:13] jynus: I'll that do in a bit, currently in the middle of something else [10:01:18] sure :-D [10:04:11] I also don't see any meaningful NEL spikes or anything on other general health metrics [10:04:27] thanks for everybody that quickly jumped to help, btw [10:04:45] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [10:05:35] jynus: Sorry, I though I copy pasted the config, but apparently not [10:06:08] oh, I don't think that needs justification! [10:06:26] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 104 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:06:26] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 104 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:06:27] the automation M. mention would help! [10:07:22] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3704 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:10:49] (03PS17) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) [10:13:03] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [10:17:56] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 999584 and 469 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:19:39] (03CR) 10MVernon: "I've fixed the templating errors and suchlike, so I think this is good to go now - are you happy to +1 it again, please?" [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [10:20:56] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:20] PROBLEM - Check no envoy runtime configuration is left persistent on idm-test1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:23:38] ACKNOWLEDGEMENT - Check no envoy runtime configuration is left persistent on idm-test1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused Slyngshede In process of being setup https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:23:38] ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service Slyngshede In process of being setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:44] 10SRE, 10Acme-chief: Ci check for acme-chief changes - https://phabricator.wikimedia.org/T326942 (10MoritzMuehlenhoff) [10:23:56] PROBLEM - Check that envoy is running on idm-test1001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:24:33] ACKNOWLEDGEMENT - Check that envoy is running on idm-test1001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed Slyngshede In process of being setup https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:38:01] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [10:44:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] ssh: update match_config data structure [puppet] - 10https://gerrit.wikimedia.org/r/879602 (owner: 10Jbond) [10:44:43] (03CR) 10Jbond: [C: 03+2] ssh::server: add validate_cmd to sshd_config [puppet] - 10https://gerrit.wikimedia.org/r/879605 (owner: 10Jbond) [10:48:34] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:46] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:58] lol, that didn't last long [10:53:53] !log installing bast5003 T324974 [10:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] T324974: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 [10:54:41] godog: some ongoing issues with ubuntu mirror, will try restarting later in the day, but it is out of our control, as I understand it [10:55:07] acking it until later [10:55:56] (I mean with the updated, not with our local one) [10:56:00] *updater [10:56:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:03:17] (03CR) 10Jbond: [C: 03+1] "lgtm optional nit. would be great if we could add something to the confine but couldn't think of anything simple. best i could think of " [puppet] - 10https://gerrit.wikimedia.org/r/879624 (owner: 10JHathaway) [11:09:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879743 (owner: 10Muehlenhoff) [11:10:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/879742 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [11:10:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/879747 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [11:10:42] jynus: ack, thank you for the update [11:18:44] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:34] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36326305056 and 1078 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:19:58] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22473766320 and 1102 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:20:12] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40463327256 and 1115 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:20:36] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26497644528 and 1139 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:20:38] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24751657496 and 1142 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:20:42] effie or other service ops folks maybe? ^ see maps pg replication alerts [11:21:50] (03CR) 10Jbond: Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [11:23:28] PROBLEM - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:53] (03PS5) 10Ayounsi: Netbox: add support for central Redis [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) [11:26:09] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [11:26:40] RECOVERY - Check systemd state on idm-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:48] (03CR) 10Ayounsi: Netbox: add support for central Redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [11:26:52] RECOVERY - Check that envoy is running on idm-test1001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:39:37] (03CR) 10MVernon: [C: 03+2] swift: disable swifrepl timer job [puppet] - 10https://gerrit.wikimedia.org/r/879520 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [11:39:47] (03CR) 10MVernon: [C: 03+2] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [11:45:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new bastions - jmm@cumin2002" [11:48:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new bastions - jmm@cumin2002" [11:54:23] (03PS1) 10Slyngshede: C:IDM Add the host itself to allowed hosts. [puppet] - 10https://gerrit.wikimedia.org/r/879767 [11:54:48] RECOVERY - Check no envoy runtime configuration is left persistent on idm-test1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:56:38] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [11:57:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39125/console" [puppet] - 10https://gerrit.wikimedia.org/r/879767 (owner: 10Slyngshede) [11:58:00] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 102, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:58:16] (03PS1) 10MVernon: swift: make storage servers also conftool clients [puppet] - 10https://gerrit.wikimedia.org/r/879769 (https://phabricator.wikimedia.org/T299125) [11:59:29] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879769 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [12:01:11] (03CR) 10Jelto: [C: 03+2] gitlab: start restore job later on replicas [puppet] - 10https://gerrit.wikimedia.org/r/879406 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [12:02:34] (03CR) 10MVernon: "Sorry, I had the wrong conftool thing to get confctl available." [puppet] - 10https://gerrit.wikimedia.org/r/879769 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [12:07:46] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:28] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.7 [puppet] - 10https://gerrit.wikimedia.org/r/879775 (https://phabricator.wikimedia.org/T326815) [12:22:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:25:18] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 11 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:25:41] ^ fixing this atm [12:27:46] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:27:49] (03PS2) 10Slyngshede: C:idm:deployment change config to work with Envoy [puppet] - 10https://gerrit.wikimedia.org/r/879767 [12:30:50] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently disabled (Running Backup Restore), not alerting. Last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:33:42] jelto: let's schedule some meeting at some point this quarter to try to make up a plan for better gitlab backups :-D [12:35:15] jynus: yeah. That's already planned as a OKR. However the current issue is about restore of backups [12:37:04] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:46] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:38:47] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab2002.wikimedia.org with reason: troubeleshoot backup restore on gitlab replica [12:39:00] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab2002.wikimedia.org with reason: troubeleshoot backup restore on gitlab replica [12:47:46] (JobUnavailable) firing: (13) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:48:47] !log installing bast6002 T324974 [12:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:51] T324974: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 [12:49:34] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879767 (owner: 10Slyngshede) [12:52:44] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:46] (JobUnavailable) firing: (13) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879775 (https://phabricator.wikimedia.org/T326815) (owner: 10Jelto) [12:53:19] (03CR) 10Muehlenhoff: [C: 03+2] Also remove duplicates from absent_ldap [puppet] - 10https://gerrit.wikimedia.org/r/879743 (owner: 10Muehlenhoff) [12:53:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:02:46] (JobUnavailable) firing: (13) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:04:39] (03CR) 10Abijeet Patro: [C: 03+1] testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879276 (https://phabricator.wikimedia.org/T323667) (owner: 10KartikMistry) [13:04:45] (03PS3) 10Abijeet Patro: testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879276 (https://phabricator.wikimedia.org/T323667) (owner: 10KartikMistry) [13:04:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879769 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:05:26] (03CR) 10MVernon: [C: 03+2] swift: make storage servers also conftool clients [puppet] - 10https://gerrit.wikimedia.org/r/879769 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:06:01] godog: cheers, tx [13:06:09] cc hnowlan [13:08:49] (03PS1) 10Muehlenhoff: Default to the bullseye installer for VM [puppet] - 10https://gerrit.wikimedia.org/r/879782 [13:09:57] (03CR) 10BBlack: [C: 03+1] varnish: Revert export of Prometheus params [puppet] - 10https://gerrit.wikimedia.org/r/878180 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [13:12:30] (03CR) 10Xcollazo: [C: 03+1] "Thank you for this patch! LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/879648 (https://phabricator.wikimedia.org/T326827) (owner: 10Ottomata) [13:14:46] (03PS1) 10MVernon: swift: fix typo in rclone.conf template [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) [13:17:01] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:17:46] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:08] (03PS2) 10MVernon: swift: fix typo in rclone.conf template [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) [13:19:22] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:21:51] (03CR) 10MVernon: "Sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:22:48] effie: looks like indices are being rebuilt, might be just a matter of waiting [13:22:55] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: fix typo in rclone.conf template [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:24:50] (03CR) 10MVernon: [C: 03+2] swift: fix typo in rclone.conf template [puppet] - 10https://gerrit.wikimedia.org/r/879783 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [13:25:24] hnowlan: awesome, thank you!!! [13:28:44] (03CR) 10Jelto: [C: 03+1] "lgtm. I'd like to have some feedback from RelEng too" [puppet] - 10https://gerrit.wikimedia.org/r/879693 (https://phabricator.wikimedia.org/T326515) (owner: 10Legoktm) [13:39:24] (03PS1) 10Jbond: idp: add oidc_issuers_pattern via the profile [puppet] - 10https://gerrit.wikimedia.org/r/879807 (https://phabricator.wikimedia.org/T311999) [13:40:02] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1804248 and 7083 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:40:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39126/console" [puppet] - 10https://gerrit.wikimedia.org/r/879807 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [13:47:18] (03Abandoned) 10Jbond: idp: add oidc_issuers_pattern via the profile [puppet] - 10https://gerrit.wikimedia.org/r/879807 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [13:52:08] (03PS2) 10Volans: Upstream release v4.2.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/879620 [13:54:56] (03PS1) 10Jbond: idp: add idm-test services [puppet] - 10https://gerrit.wikimedia.org/r/879809 (https://phabricator.wikimedia.org/T311999) [13:55:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39127/console" [puppet] - 10https://gerrit.wikimedia.org/r/879809 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [13:56:49] (03CR) 10Volans: "Tested on build2001 for sid, it works also with HOME=/nonexistent and uscan seems happy too." [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/879620 (owner: 10Volans) [13:57:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp: add idm-test services [puppet] - 10https://gerrit.wikimedia.org/r/879809 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [14:00:49] (03CR) 10Ottomata: flink-kubernetes-operator - allow flink-app pods to talk to k8s API (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:05:36] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment change config to work with Envoy [puppet] - 10https://gerrit.wikimedia.org/r/879767 (owner: 10Slyngshede) [14:09:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/879782 (owner: 10Muehlenhoff) [14:12:46] (03CR) 10JMeybohm: [C: 03+1] flink-kubernetes-operator - allow flink-app pods to talk to k8s API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:13:21] (03PS2) 10Ottomata: flink-kubernetes-operator - allow flink-app pods to talk to k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) [14:14:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:16:22] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:19:32] (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/879620 (owner: 10Volans) [14:21:04] (03CR) 10Muehlenhoff: [C: 03+2] Default to the bullseye installer for VM [puppet] - 10https://gerrit.wikimedia.org/r/879782 (owner: 10Muehlenhoff) [14:25:46] (03PS1) 10Papaul: Add new an-coord and an-mariad to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879812 (https://phabricator.wikimedia.org/T321119) [14:26:47] (03PS2) 10Papaul: Add new an-coord and an-mariadb to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879812 (https://phabricator.wikimedia.org/T321119) [14:27:00] (03PS3) 10Papaul: Add new an-coord and an-mariadb to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879812 (https://phabricator.wikimedia.org/T321119) [14:29:14] (03CR) 10Papaul: [C: 03+2] Add new an-coord and an-mariadb to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879812 (https://phabricator.wikimedia.org/T321119) (owner: 10Papaul) [14:32:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) [14:33:46] (03CR) 10Volans: [C: 03+2] Upstream release v4.2.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/879620 (owner: 10Volans) [14:34:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host an-coord1003.eqiad.wmnet with OS bullseye [14:34:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-coord100... [14:38:16] (03PS1) 10Filippo Giunchedi: webperf: use rsync::quickdatacopy for arclamp data [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) [14:38:18] (03PS1) 10Filippo Giunchedi: arclamp: move to EnvironmentFile for generate/compress jobs [puppet] - 10https://gerrit.wikimedia.org/r/879814 (https://phabricator.wikimedia.org/T319434) [14:38:29] (03CR) 10JMeybohm: [C: 03+1] flink-kubernetes-operator - allow flink-app pods to talk to k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:40:01] (03Merged) 10jenkins-bot: Upstream release v4.2.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/879620 (owner: 10Volans) [14:40:17] (03CR) 10Muehlenhoff: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [14:41:43] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10herron) a:05herron→03None [14:42:02] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10herron) p:05High→03Low [14:43:18] (03CR) 10Filippo Giunchedi: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [14:43:53] (03CR) 10Filippo Giunchedi: [C: 03+2] Clean up monitor metrics on stop() [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/844469 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [14:46:10] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 876560 and 3253 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:46:56] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1478288 and 3298 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:48:14] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10hashar) Not sure about the comparison but on deploy1002 we have roughly: `lines=10 $ (cd /srv/deployment && find . -mindepth... [14:49:41] !log uploaded cumin_4.2.0 to apt.wikimedia.org bullseye-wikimedia [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:54] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 69 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:52:52] (03CR) 10Muehlenhoff: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [14:54:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-coord1003.eqiad.wmnet with reason: host reimage [14:57:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-coord1003.eqiad.wmnet with reason: host reimage [15:00:27] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 166 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:01:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host an-coord1004.eqiad.wmnet with OS bullseye [15:01:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-coord100... [15:05:51] (03CR) 10Filippo Giunchedi: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [15:07:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:11:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:11:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:12:49] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 77, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-coord1004.eqiad.wmnet with reason: host reimage [15:16:41] (03CR) 10Muehlenhoff: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [15:17:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-coord1004.eqiad.wmnet with reason: host reimage [15:18:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:18:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-coord1003.eqiad.wmnet with OS bullseye [15:18:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-coord1003.eq... [15:19:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new bastion - jmm@cumin2002" [15:19:59] (03CR) 10JMeybohm: "Should we add a template for this annotation right away to the base module?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [15:20:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host an-mariadb1001.eqiad.wmnet with OS bullseye [15:20:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-mariadb1... [15:20:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new bastion - jmm@cumin2002" [15:22:52] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:50] !log restarted again update-ubuntu-mirror on mirror1001 due to remote server concurrency issues [15:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:34:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:34:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-coord1004.eqiad.wmnet with OS bullseye [15:34:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-coord1004.eq... [15:37:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) [15:38:13] (03CR) 10JHathaway: "thanks for reviewing @jond!" [puppet] - 10https://gerrit.wikimedia.org/r/879624 (owner: 10JHathaway) [15:38:17] (03CR) 10JHathaway: [C: 03+2] facter block_devices support containers [puppet] - 10https://gerrit.wikimedia.org/r/879624 (owner: 10JHathaway) [15:39:52] (03PS1) 10Vivian Rook: jupyter referer [puppet] - 10https://gerrit.wikimedia.org/r/879824 (https://phabricator.wikimedia.org/T326217) [15:42:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] jupyter referer [puppet] - 10https://gerrit.wikimedia.org/r/879824 (https://phabricator.wikimedia.org/T326217) (owner: 10Vivian Rook) [15:42:31] (03CR) 10Vivian Rook: [C: 03+2] jupyter referer [puppet] - 10https://gerrit.wikimedia.org/r/879824 (https://phabricator.wikimedia.org/T326217) (owner: 10Vivian Rook) [15:43:26] (03PS1) 10Papaul: fix typo for an-mariadb node in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879825 (https://phabricator.wikimedia.org/T321119) [15:44:19] (03CR) 10Papaul: [C: 03+2] fix typo for an-mariadb node in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879825 (https://phabricator.wikimedia.org/T321119) (owner: 10Papaul) [15:45:03] 10SRE, 10serviceops-collab: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10LSobanski) [15:45:10] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 14607817728 and 1036 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:45:43] 10SRE, 10SRE-swift-storage: Number of mw swift objects in eqiad greater than codfw - https://phabricator.wikimedia.org/T326857 (10LSobanski) [15:45:46] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3953103024 and 1074 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:46:08] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3447562144 and 1096 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:56:21] (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:57:58] (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/879693 (https://phabricator.wikimedia.org/T326515) (owner: 10Legoktm) [15:59:32] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7148000024 and 542 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:04:08] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 302 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:06:10] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:07:42] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:08:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/879814 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [16:09:36] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29717379712 and 1806 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:13:31] (03PS1) 10MVernon: Alerts: stop alerting on thumb number mismatch [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T313102) [16:14:12] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10RobH) [16:14:25] (03PS2) 10MVernon: Alerts: stop alerting on thumb number mismatch [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T326857) [16:14:40] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10RobH) [16:15:09] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T326857) (owner: 10MVernon) [16:16:47] (03PS2) 10Jdlrobson: English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) [16:18:33] (03PS3) 10Jdlrobson: English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) [16:18:40] (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Revert export of Prometheus params [puppet] - 10https://gerrit.wikimedia.org/r/878180 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [16:19:05] 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10bacula: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10jcrespo) [16:19:37] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T326857) (owner: 10MVernon) [16:19:51] (03PS3) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) [16:20:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10jcrespo) [16:20:22] 10ops-codfw, 10DC-Ops, 10Data-Persistence, 10Data-Persistence-Backup, 10bacula: Q4:rack/setup/install backup2010, backup2011 - https://phabricator.wikimedia.org/T326965 (10jcrespo) [16:22:06] (03CR) 10Jdlrobson: [C: 04-1] "todo: page tools" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [16:22:32] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 409 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:27:09] (03CR) 10Eevans: [C: 03+1] Alerts: stop alerting on thumb number mismatch [puppet] - 10https://gerrit.wikimedia.org/r/879827 (https://phabricator.wikimedia.org/T326857) (owner: 10MVernon) [16:28:10] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01053 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:31:42] (03PS1) 10BCornwall: prometheus: Properly set old script as absent [puppet] - 10https://gerrit.wikimedia.org/r/879829 (https://phabricator.wikimedia.org/T323723) [16:34:35] (03PS2) 10BCornwall: prometheus: Properly set old script as absent [puppet] - 10https://gerrit.wikimedia.org/r/879829 (https://phabricator.wikimedia.org/T323723) [16:35:34] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39129/console" [puppet] - 10https://gerrit.wikimedia.org/r/879829 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [16:36:06] (03CR) 10BCornwall: [V: 03+1 C: 03+2] prometheus: Properly set old script as absent [puppet] - 10https://gerrit.wikimedia.org/r/879829 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [16:39:46] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 434 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:42:16] (03PS1) 10Abijeet Patro: TranslationNotificationsSubmitJob: Ensure LanguageSet is in proper format [extensions/TranslationNotifications] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879793 (https://phabricator.wikimedia.org/T63125) [16:43:30] PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:36] PROBLEM - Check systemd state on cp3062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:10] PROBLEM - Check systemd state on cp5022 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:30] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 780 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:46:20] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 830 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:51:26] RECOVERY - Check systemd state on cp1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:10] PROBLEM - Check systemd state on cp6009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:12] PROBLEM - Check systemd state on cp4043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:16] PROBLEM - Check systemd state on cp4040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:18] PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_varnish_params.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:05] (03PS7) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [17:07:18] jouncebot: now [17:07:19] For the next 14 hour(s) and 52 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230113T0800) [17:07:54] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003008 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:09:41] (03PS2) 10Jdlrobson: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) [17:13:26] thcipriani, ready to emergency deploy for 879820: TranslationNotificationsSubmitJob: Ensure LanguageSet is in proper format | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TranslationNotifications/+/879820 -- context is https://phabricator.wikimedia.org/T63125#8523935 [17:13:40] abijeet: sure [17:14:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/TranslationNotifications] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879793 (https://phabricator.wikimedia.org/T63125) (owner: 10Abijeet Patro) [17:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:20:37] (03Merged) 10jenkins-bot: TranslationNotificationsSubmitJob: Ensure LanguageSet is in proper format [extensions/TranslationNotifications] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879793 (https://phabricator.wikimedia.org/T63125) (owner: 10Abijeet Patro) [17:20:54] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:879793|TranslationNotificationsSubmitJob: Ensure LanguageSet is in proper format (T63125)]] [17:20:58] T63125: Ability to notify all languages except some - https://phabricator.wikimedia.org/T63125 [17:22:24] RECOVERY - Check systemd state on cp4040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:28] RECOVERY - Check systemd state on cp5022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:34] !log thcipriani@deploy1002 thcipriani and abi: Backport for [[gerrit:879793|TranslationNotificationsSubmitJob: Ensure LanguageSet is in proper format (T63125)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [17:22:54] RECOVERY - Check systemd state on cp6009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:55] ^ abijeet your backport should be on mwdebug servers, can you verify for me? [17:22:56] RECOVERY - Check systemd state on cp4043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:06] thcipriani, sure. checking. [17:23:26] RECOVERY - Check systemd state on cp3062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:26] RECOVERY - Check systemd state on cp6011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:32] thcipriani, the fix is in a job class, that would not be updated until we do a full deploy right? [17:25:43] oh, as in it runs on the jobrunners async? yes. that's accurate. mwdebug is appservers. [17:27:10] thcipriani, i verified on debug servers that nothing further is broken. :-) I see the same error in the job queue. [17:27:34] ok, I'll go ahead and once it's live on jobrunners I'll let you know [17:34:19] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:879793|TranslationNotificationsSubmitJob: Ensure LanguageSet is in proper format (T63125)]] (duration: 13m 25s) [17:34:23] T63125: Ability to notify all languages except some - https://phabricator.wikimedia.org/T63125 [17:34:29] ^ abijeet alright, should be live everywhere now [17:35:19] checking [17:37:12] thcipriani, the fix appears to be working. Thank you! [17:37:34] \o/ thanks for checking :) [17:43:41] (03CR) 10Andrew Bogott: [C: 03+1] hieradata: add wmcs-roots to clouddumps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [17:58:41] (03CR) 10Dzahn: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.7 [puppet] - 10https://gerrit.wikimedia.org/r/879775 (https://phabricator.wikimedia.org/T326815) (owner: 10Jelto) [18:06:30] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:20:46] (03PS1) 10Ssingh: Release 1.15.10 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/879839 (https://phabricator.wikimedia.org/T321191) [18:25:08] !log mwscript extensions/GlobalBlocking/maintenance/FixBlockerUsername.php --wiki metawiki "Green Giant" "Cromium" # T298707 [18:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:12] T298707: "InvalidArgumentException: Blocker must be a local user" from GlobalBlocking - https://phabricator.wikimedia.org/T298707 [18:29:03] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:32:07] (03PS7) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [18:33:37] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39130/console" [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [18:37:35] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:37:47] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:37:55] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758 (10fnegri) [18:38:03] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [18:38:56] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:39:40] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [18:39:48] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:41:49] (03CR) 10Herron: [C: 03+1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [18:47:52] (03PS3) 10BCornwall: prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) [18:50:04] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) As Q2 is now over, I suggest that we consider this first iteration complete as soon as we can successfully run WMCS cookbooks from the ne... [18:53:57] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39131/console" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:03:59] (03PS8) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [19:06:36] (03PS1) 10Papaul: Fix an-mariadb100[1-2] in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879847 (https://phabricator.wikimedia.org/T321119) [19:07:19] (03PS9) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [19:07:48] (03CR) 10Papaul: [C: 03+2] Fix an-mariadb100[1-2] in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/879847 (https://phabricator.wikimedia.org/T321119) (owner: 10Papaul) [19:09:45] (03PS2) 10BCornwall: varnish: Alert on high thread count [alerts] - 10https://gerrit.wikimedia.org/r/878166 (https://phabricator.wikimedia.org/T323723) [19:09:47] (03PS1) 10BCornwall: node: Exclude varnish params file from stale check [alerts] - 10https://gerrit.wikimedia.org/r/879848 (https://phabricator.wikimedia.org/T323723) [19:12:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:13:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:41] (03PS4) 10BCornwall: prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) [19:14:43] (03PS1) 10BCornwall: fixup! prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/879849 [19:15:20] (03Abandoned) 10BCornwall: fixup! prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/879849 (owner: 10BCornwall) [19:15:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.927 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:57] (03PS5) 10BCornwall: prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) [19:17:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:22] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39132/console" [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:22:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host an-mariadb1002.eqiad.wmnet with OS bullseye [19:22:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-mariadb1002.eqiad.wmnet wi... [19:22:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-mariadb1001.eqiad.wmnet with reason: host reimage [19:23:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-mariadb1001.eqiad.wmnet with reason: host reimage [19:28:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:28:34] (03CR) 10BBlack: [C: 03+1] "Looks solid!" [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:29:00] (03CR) 10BBlack: [C: 03+1] prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:31:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:39] (03CR) 10BCornwall: [V: 03+1 C: 03+2] prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:34:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-mariadb1002.eqiad.wmnet with reason: host reimage [19:36:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-mariadb1002.eqiad.wmnet with reason: host reimage [19:38:24] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:39:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:40:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:41:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-mariadb1001.eqiad.wmnet with OS bullseye [19:41:05] (03CR) 10BCornwall: "Sorry for needing the re-review but the metrics name had to change! There's only the one varnish_param_threads_max that comes pre-calculat" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:41:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-mariadb1001.... [19:42:06] (03PS1) 10Zabe: eventlogging: drop absented check_eventlogging_jobs file [puppet] - 10https://gerrit.wikimedia.org/r/879854 [19:42:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:37] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39133/console" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:49:14] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host aphlict2001.codfw.wmnet [19:49:15] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:51:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:52:04] (03CR) 10BCornwall: [V: 03+1 C: 03+2] prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:52:15] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict2001.codfw.wmnet - dzahn@cumin2002" [19:52:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:54:10] (03PS2) 10BCornwall: node: Exclude varnish params file from stale check [alerts] - 10https://gerrit.wikimedia.org/r/879848 (https://phabricator.wikimedia.org/T323723) [19:54:18] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict2001.codfw.wmnet - dzahn@cumin2002" [19:54:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:54:18] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache aphlict2001.codfw.wmnet on all recursors [19:54:20] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aphlict2001.codfw.wmnet on all recursors [19:57:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:58:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:58:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-mariadb1002.eqiad.wmnet with OS bullseye [19:58:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-mariadb1002.... [19:59:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) [20:04:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aphlict2001.codfw.wmnet [20:12:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Papaul) 05Open→03Resolved @BTullis this is done. [20:16:02] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1009'] [20:16:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:57] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1010'] [20:17:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:28] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Aklapper) [20:17:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:22:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:24:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.229 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:24:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:28:03] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['druid1009'] [20:35:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1009'] [20:35:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['druid1009'] [20:36:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['druid1010'] [20:36:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1010'] [20:37:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['druid1010'] [20:37:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1011'] [20:39:29] (03PS1) 10Daniel Kinzler: Use sendemail limit instead of emailuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879864 [20:39:34] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Papaul) @BTullis can you please specify the exact partman recipe to use? Thanks [20:44:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:34] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:48:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['druid1011'] [20:49:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1011'] [20:52:48] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:28] (03PS2) 10Daniel Kinzler: Use sendemail limit instead of emailuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879864 [20:55:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['druid1011'] [20:57:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['druid1011'] [20:58:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts ['druid1011'] [21:00:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Papaul) [21:03:33] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:07:48] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:08:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:34] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:21:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:22:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:32] (03CR) 10BBlack: [C: 03+2] node: Exclude varnish params file from stale check [alerts] - 10https://gerrit.wikimedia.org/r/879848 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [21:27:07] (03CR) 10BBlack: [C: 03+1] node: Exclude varnish params file from stale check [alerts] - 10https://gerrit.wikimedia.org/r/879848 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [21:28:32] (03CR) 10BBlack: [C: 03+1] varnish: Alert on high thread count [alerts] - 10https://gerrit.wikimedia.org/r/878166 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [21:29:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:39] (03CR) 10BCornwall: [C: 03+2] node: Exclude varnish params file from stale check [alerts] - 10https://gerrit.wikimedia.org/r/879848 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [21:37:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:42:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:49:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:59:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:01:06] (03CR) 10Krinkle: [C: 03+1] webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [22:01:45] (03CR) 10Krinkle: [C: 03+1] arclamp: move to EnvironmentFile for generate/compress jobs [puppet] - 10https://gerrit.wikimedia.org/r/879814 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [22:02:43] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Dzahn) >>! In T326590#8512008, @RhinosF1 wrote: > @mutante: adding as IC, can you please let people know when the incident report from la... [22:04:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:05:57] (03CR) 10Dzahn: webperf: use rsync::quickdatacopy for arclamp data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879813 (https://phabricator.wikimedia.org/T319434) (owner: 10Filippo Giunchedi) [22:05:59] (03PS3) 10BCornwall: varnish: Alert on high thread count [alerts] - 10https://gerrit.wikimedia.org/r/878166 (https://phabricator.wikimedia.org/T323723) [22:08:45] (03CR) 10BCornwall: [C: 03+2] varnish: Alert on high thread count [alerts] - 10https://gerrit.wikimedia.org/r/878166 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [22:09:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:11:31] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:13:01] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:14:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:16:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:19:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:58] re: cr2-esams - it's Lumen. checking for maint-announce [22:20:26] yes, it is scheduled maintenance work - ACKing [22:20:48] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn Lumen Scheduled Maintenance Window #: 25649651-1, Work Started https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:23:37] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:26:44] !log mirror1001 - systemctl start update-ubuntu-mirror (sometimes sync fails) [22:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:47] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:43] (03PS1) 10Cwhite: apifeatureusage: use new kafka truststore [puppet] - 10https://gerrit.wikimedia.org/r/879886 (https://phabricator.wikimedia.org/T300130) [22:31:58] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn tech dispatch to site with an ETA of 3:00 pm ET. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:33] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 225, down: 2, dormant: 0, excluded: 0, unused: 0: daniel_zahn tech dispatch to site with an ETA of 3:00 pm ET https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:35:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:42:05] (03PS2) 10Cwhite: profile: clean up legacy apifeatureusage class [puppet] - 10https://gerrit.wikimedia.org/r/879886 [22:44:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:53:10] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Dzahn) In an ideal world, once we know a new expiration date, we could add it to the "mainteance calendar", like 2 weeks before it expires. And then the clinic... [22:54:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:53] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:59:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:04:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:09:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:18:02] (03PS1) 10Cwhite: role, profile: remove logstash(7) role and hiera config [puppet] - 10https://gerrit.wikimedia.org/r/879887 [23:18:04] (03PS1) 10Cwhite: role: remove kibana7_ecs role [puppet] - 10https://gerrit.wikimedia.org/r/879888 [23:18:06] (03PS1) 10Cwhite: role, profile: remove elasticsearch role and supporting profile [puppet] - 10https://gerrit.wikimedia.org/r/879889 [23:22:59] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [23:24:02] (03CR) 10Cwhite: [C: 03+1] "LGTM, should be safe to deploy when you are ready." [puppet] - 10https://gerrit.wikimedia.org/r/879417 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [23:24:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:28:51] (03PS1) 10Dzahn: peopleweb: only have rsync auto_restart service on active server [puppet] - 10https://gerrit.wikimedia.org/r/879876 (https://phabricator.wikimedia.org/T326888) [23:29:11] (03CR) 10CI reject: [V: 04-1] peopleweb: only have rsync auto_restart service on active server [puppet] - 10https://gerrit.wikimedia.org/r/879876 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [23:30:58] (03PS2) 10Dzahn: peopleweb: only have rsync auto_restart service on active server [puppet] - 10https://gerrit.wikimedia.org/r/879876 (https://phabricator.wikimedia.org/T326888) [23:32:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/879876/39136/people2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/879876 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [23:33:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:50] 10SRE, 10serviceops-collab, 10Patch-For-Review: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) Actually the proper fix is to not put the "auto_restart_rsync" service on the passive host as it's expected from the code that rsyncd is not installed on the destination of rsync::... [23:37:42] 10SRE, 10serviceops-collab, 10Patch-For-Review: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) Though.. the code in rsync::quickdatacopy is: ` if $source_host == $::fqdn { include rsync::server ` and there is no else-branch. So this means when we switch... [23:38:03] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on people1003, removed timer/service for auto_restarts on people2002" [puppet] - 10https://gerrit.wikimedia.org/r/879876 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [23:38:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:39:20] !log people2002 - systemctl reset-failed after removing auto_restart_rsync timers [23:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:23] RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:52] 10SRE, 10serviceops-collab, 10Patch-For-Review: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) 23:40 <+icinga-wm> RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_st... [23:46:04] (03PS1) 10Dzahn: peopleweb: ensure rsync service is stopped on passive host [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) [23:47:35] (03PS2) 10Dzahn: peopleweb: ensure rsync service is stopped on passive host [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) [23:47:54] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/879878/39137/" [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [23:48:15] 10SRE, 10serviceops-collab, 10Patch-For-Review: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) p:05Triage→03Low [23:49:41] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) 05Open→03Resolved [23:51:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:56:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:57:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown