[00:33:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [00:42:08] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:42:56] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:38] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.83 ms [01:30:22] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:39:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:56] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:43:25] (03CR) 10Andrew Bogott: [C: 03+1] openstack.galera: add nodecheck logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/809100 (owner: 10David Caro) [01:46:24] (03CR) 10Andrew Bogott: [C: 03+1] openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 (owner: 10David Caro) [01:49:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32382 and previous config saved to /var/cache/conftool/dbconfig/20220815-015020-ladsgroup.json [01:50:25] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [01:52:27] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 (owner: 10David Caro) [01:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [01:57:14] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:01:08] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [02:02:40] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:04:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32383 and previous config saved to /var/cache/conftool/dbconfig/20220815-020526-ladsgroup.json [02:07:02] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:09:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32384 and previous config saved to /var/cache/conftool/dbconfig/20220815-022032-ladsgroup.json [02:29:34] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32385 and previous config saved to /var/cache/conftool/dbconfig/20220815-023538-ladsgroup.json [02:35:43] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [02:46:14] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:06] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:28] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:16] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:46] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:08] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:31:06] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:06] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:22] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [06:12:06] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:38] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:32] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [06:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:22:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:08] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:32] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:10] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:30] jouncebot: next [06:48:30] In 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T0700) [07:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T0700). Please do the needful. [07:00:05] phuedx, Urbanecm, and MdsShakil: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:23] I can deploy today! [07:00:28] thanks! [07:01:05] * urbanecm starts with his own patches [07:01:08] (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822701 (https://phabricator.wikimedia.org/T315141) (owner: 10Urbanecm) [07:01:15] (03CR) 10Urbanecm: [C: 03+2] throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822702 (owner: 10Urbanecm) [07:01:18] (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822725 (https://phabricator.wikimedia.org/T315182) (owner: 10Urbanecm) [07:02:01] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822701 (https://phabricator.wikimedia.org/T315141) (owner: 10Urbanecm) [07:02:09] (03Merged) 10jenkins-bot: throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822702 (owner: 10Urbanecm) [07:02:12] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822725 (https://phabricator.wikimedia.org/T315182) (owner: 10Urbanecm) [07:03:48] phuedx: MdsShakil: are you around, please? :) [07:03:56] Yah [07:04:20] urbanecm: [07:04:37] (03PS13) 10Urbanecm: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [07:04:42] (03CR) 10Urbanecm: [C: 03+2] Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [07:05:03] hello MdsShakil, thanks. let's do your patch next. Are you familiar with how testing via x-wikimedia-debug works (it's fine if not, i can explain that)? [07:05:33] (03Merged) 10jenkins-bot: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [07:06:38] urbanecm: Previously I worked with mwdebug1001.eqiad, is that same? [07:06:43] yup [07:06:53] mwdebug1001 is one of the debug servers [07:06:55] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 7c2a393ee: dc0d62a3: 6f687bcfc: Update throttle rules (T315182, T315141) (duration: 03m 21s) [07:06:59] I'll let you know once your patch is there [07:07:03] T315182: Request a throttle lift for Wiki-Editathon – 2022-08-23 - https://phabricator.wikimedia.org/T315182 [07:07:03] T315141: Request a throttle lift for Festival of media education – 2022-08-16 - 2022-08-18 - https://phabricator.wikimedia.org/T315141 [07:07:19] MdsShakil: your patch is at mwdebug1001. can you test it there? [07:08:46] !log mwscript resetAuthenticationThrottle.php --wiki=cswiki --signup --ip='194.31.191.20' # T315141 [07:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:09:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:09:29] urbanecm: I think all ok [07:09:29] (03PS5) 10Urbanecm: Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [07:09:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:09:34] MdsShakil: okay, thanks! [07:09:35] syncing [07:09:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:09:38] (03CR) 10Urbanecm: [C: 03+2] Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [07:09:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:09:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T314041)', diff saved to https://phabricator.wikimedia.org/P32386 and previous config saved to /var/cache/conftool/dbconfig/20220815-070955-ladsgroup.json [07:09:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:00] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:10:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:10:31] (03Merged) 10jenkins-bot: Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [07:11:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:05] phuedx: B&C window is happening, are you around? :) [07:13:20] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 43cd5ef1bc38bdc8f46f3093cf0baa74cccc9678: Add bnwiki in wgImportSources to bnwikibooks (T314820) (duration: 03m 05s) [07:13:24] T314820: Add bnwiki in wgImportSources to bnwikibooks - https://phabricator.wikimedia.org/T314820 [07:13:25] MdsShakil: your patch should be live! [07:13:29] anything else? :) [07:16:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:16:30] urbanecm: Thank you. That's working fine on onwiki also. [07:16:30] https://w.wiki/5aFK [07:16:40] great :) [07:17:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:17:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:17:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a454d3bc56c344fa62625f7c292ea087bddfebe5: Pin wgCheckUserLogReasonMigrationStage to read and write old (T233004) (duration: 03m 16s) [07:17:10] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [07:17:34] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:17:45] !log UTC morning B&C window done [07:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:23:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:58] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 1.14 ms [07:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:27:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:27:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:31:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:33:30] (03PS1) 10Kevin Bazira: ml-services: Add euwiki, huwiki & hywiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/823109 (https://phabricator.wikimedia.org/T314456) [07:39:26] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:52:34] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.79 ms [07:52:52] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:02] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:56] (03PS1) 10Tim Starling: Switch www.mediawiki.org to multi-DC mode [puppet] - 10https://gerrit.wikimedia.org/r/823113 [08:09:09] (03PS2) 10Jbond: Add Cumin aliases for ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/820129 (owner: 10Muehlenhoff) [08:10:40] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:11:29] (03CR) 10Jbond: [C: 03+1] "LGTM but will leave someone from wmcs to merge" [puppet] - 10https://gerrit.wikimedia.org/r/821759 (owner: 10Majavah) [08:12:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [08:17:04] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [08:25:44] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Cool, thanks so much for all these explanation!! >>! In T308799#8115381, @BBlack wrote: > I'm hoping that least some banner outputs will be categorically cac... [08:28:03] (03CR) 10Jbond: Add names to flow collectors (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [08:32:38] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:39:48] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:34] (03PS3) 10Ladsgroup: Allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) (owner: 10Daniel Kinzler) [08:45:38] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 712.00 ms [08:46:56] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10EChetty) [08:50:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10EChetty) [08:50:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10EChetty) [08:50:56] (03PS1) 10Jelto: install_server: change partman config for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/823115 (https://phabricator.wikimedia.org/T274463) [09:01:10] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:02:18] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [09:13:30] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:17:26] (03CR) 10Jbond: "took a quick pass, lgtm but see inline for comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [09:17:48] (03CR) 10David Caro: [C: 03+2] wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 (owner: 10David Caro) [09:18:00] (03CR) 10David Caro: icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro) [09:18:03] (03CR) 10David Caro: [C: 03+2] icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro) [09:20:31] (03Merged) 10jenkins-bot: icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro) [09:20:33] (03Merged) 10jenkins-bot: wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 (owner: 10David Caro) [09:21:10] (03CR) 10Jbond: [C: 03+1] admin: Move soworu-01 from ldap-only to analytics [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) (owner: 10BCornwall) [09:27:44] (03CR) 10Clément Goubert: [C: 03+1] "LGTM code-wise. Small stylistic nits on trailing commas for consistency's sake, up to you." [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto) [09:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [10:00:06] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [10:01:43] 10SRE-swift-storage, 10ops-codfw: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10MatthewVernon) [10:03:45] !log pd 1I:1:1 modify disablepd forced on ms-be2028 T315213 [10:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:49] T315213: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 [10:18:05] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) @Papaul could you take another look at this, please, and see if we can get the replacement disk to be visible to the RAID controller? Also, I don't know if it's possible that `/d... [10:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:21:35] urbanecm: Doh. Sorry I missed the ping. I've rescheduled the patch for later today [10:22:14] phuedx: no worries, happens from time to time :) [10:26:06] (03PS3) 10Hnowlan: Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) [10:26:20] (03CR) 10Hnowlan: Create basic haproxy container (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:27:14] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:28:39] (03CR) 10MVernon: [C: 03+2] swift: move swift ring manager repo [puppet] - 10https://gerrit.wikimedia.org/r/822659 (owner: 10MVernon) [10:34:28] PROBLEM - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:34:30] ACKNOWLEDGEMENT - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315216 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:34:35] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T315216 (10ops-monitoring-bot) [10:40:12] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [10:44:19] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T315216 (10MatthewVernon) [10:44:36] 10SRE, 10SRE-swift-storage, 10ops-codfw: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10MatthewVernon) [10:50:13] (03PS1) 10Clément Goubert: scripts/run_ci_locally.sh: Fix arm Mac docker platform warning [puppet] - 10https://gerrit.wikimedia.org/r/823122 [10:58:13] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:06:03] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro) [11:13:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro) [11:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:29:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10Peachey88) [11:30:53] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [11:35:17] (03PS1) 10Stang: testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) [11:59:45] (03PS1) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [12:01:52] (03CR) 10CI reject: [V: 04-1] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:03:03] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:45] (03PS2) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [12:12:27] (03PS1) 10Stang: jawiki: Restrict abusefilter log view to "abusefilter-modify" user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) [12:21:01] (03CR) 10Btullis: [C: 03+2] Add roles and cumin aliases for the new dse_k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [12:21:15] (03PS5) 10Majavah: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) [12:23:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:45:31] (03CR) 10Ottomata: [C: 03+1] "Nice yeah, this was used for server side EventLogging extension to send events. Pretty sure we've migrated all server side usages. Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) (owner: 10Krinkle) [12:46:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Ottomata) Approved! Thank you! [12:46:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Ottomata) Approved! [12:49:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) Adding Event Platform tag, we decided to get this hardware to hopefully better support multi DC event stream processing. [12:53:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T1300). [13:00:05] phuedx and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:19] o/ [13:00:44] I can deploy today [13:00:59] (03PS3) 10Urbanecm: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:01:04] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:02:02] (03Merged) 10jenkins-bot: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:03:15] phuedx: your patch is at mwdebug1001, can you check please? [13:03:18] On it [13:04:12] (03PS2) 10Urbanecm: testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) (owner: 10Stang) [13:04:17] (03CR) 10Urbanecm: [C: 03+2] testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) (owner: 10Stang) [13:04:31] urbanecm: I've confirmed that the fully-qualified schema name is still being sent to wikidatawiki but not to, say, enwiki [13:04:32] LGTM [13:04:49] okay, syncing! [13:05:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:05:17] (03Merged) 10jenkins-bot: testwikidatawiki: Add wikidata as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823141 (https://phabricator.wikimedia.org/T315211) (owner: 10Stang) [13:06:29] (03PS6) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [13:07:47] (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [13:08:11] PROBLEM - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:08:13] ACKNOWLEDGEMENT - MegaRAID on db2110 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315229 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:08:17] 10SRE, 10ops-codfw: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10ops-monitoring-bot) [13:08:19] (03PS7) 10Jelto: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [13:08:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e2772238003b797b1a8b18b4df0aa56f54132727: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" (T290303) (duration: 03m 29s) [13:08:39] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 [13:08:42] phuedx: should be live! [13:08:56] <3 Thanks. I'll check again to be sure [13:08:57] koi: your first patch is at mwdebug1001, can you check please? [13:09:01] looking [13:09:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:48] Everything looks as expected :) [13:09:54] glad to hear that! [13:10:02] urbanecm: works as expected, LGTM [13:10:07] thanks, syncing [13:12:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:12:47] (03CR) 10Jelto: "Thanks for the review! I added a new patchset and commented in-line" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [13:13:04] (03CR) 10Urbanecm: [C: 04-2] "Hello," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) (owner: 10Stang) [13:13:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: de81bcb5874aee16b23ffea5a43466572250a6c2: testwikidatawiki: Add wikidata as import source (T315211) (duration: 03m 26s) [13:13:53] T315211: Enable transwiki import from Wikidata to Testwikidata - https://phabricator.wikimedia.org/T315211 [13:14:31] koi: I -2'ed the other change, because I strongly doubt it will have the benefit the ja.wikipedia community expects. I'll explain in more details on the task itself, the -2 is just there to ensure it's not merged before ja.wiki reaches an informed decision. [13:14:50] got it, thanks for the explaination [13:14:56] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10MatthewVernon) [13:15:02] to summary, the issue is that the same information is (and will continue to be) visible via quarry.wmcloud.org and similar [13:15:46] the other patch is live koi [13:16:09] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10MatthewVernon) [13:16:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) [13:16:29] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [13:17:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:17:26] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10MatthewVernon) a:03Papaul [13:19:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10MatthewVernon) [13:21:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) [13:23:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:28:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) I'm putting these hosts back into 'insetup' pending hdfs packages on bullseye T310643 [13:29:33] (03CR) 10Samtar: logos/manage.py: Use shortened link in user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246 (owner: 10Samtar) [13:29:34] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [13:29:39] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [13:30:21] (03PS1) 10Andrew Bogott: clouddumps100[12]: move back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/823155 (https://phabricator.wikimedia.org/T302981) [13:32:39] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps100[12]: move back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/823155 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [13:33:07] (03CR) 10JMeybohm: [C: 03+1] Create basic haproxy container (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [13:33:59] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10Papaul) This server is out of warranty and iIdon't have any disk onsite for replacement [13:34:20] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1070.eqiad.wmnet with OS bullseye [13:34:26] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1070.eqiad.wmnet with OS bullseye [13:37:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Please power down this server so i can disconnect the battery and connect it back . Note server is out of warranty. [13:37:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) Please power down this server so i can disconnect the battery and connect it back . Note server is out of warranty. [13:38:35] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:45] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:32] (03CR) 10David Caro: [C: 03+2] openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 (owner: 10David Caro) [13:46:44] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1070.eqiad.wmnet with reason: host reimage [13:47:02] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Check for already published images before pushing [debs/calico] - 10https://gerrit.wikimedia.org/r/654637 (owner: 10JMeybohm) [13:49:24] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1070.eqiad.wmnet with reason: host reimage [13:52:55] (03PS1) 10JMeybohm: Update to v3.20.6 [debs/calico] (v3.20) - 10https://gerrit.wikimedia.org/r/823159 (https://phabricator.wikimedia.org/T307943) [13:53:36] (03Merged) 10jenkins-bot: openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 (owner: 10David Caro) [13:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:03:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [14:05:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1070.eqiad.wmnet with OS bullseye [14:05:35] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1070.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [14:05:42] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-31): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMcLeod_WMF) [14:10:38] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure [14:10:51] !log hnowlan@deploy1002 Started deploy [restbase/deploy@a571f9a]: Add blwiki T310874 [14:10:53] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure [14:10:54] T310874: Add blkwiki to RESTBase - https://phabricator.wikimedia.org/T310874 [14:10:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d5e8a120-d0d2-4934-9013-0e0723fbb808) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit... [14:11:59] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) I've powered this down; if you LMK once it's back up I can then shutdown ms-be2035. [14:19:09] (03PS2) 10Ssingh: dnsrecursor: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [14:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:07] (03CR) 10Ssingh: dnsrecursor: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [14:20:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36738/console" [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [14:23:36] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:23:59] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1068.eqiad.wmnet with OS bullseye [14:24:16] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1068.eqiad.wmnet with OS bullseye [14:24:52] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:33] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@a571f9a]: Add blwiki T310874 (duration: 15m 42s) [14:26:38] T310874: Add blkwiki to RESTBase - https://phabricator.wikimedia.org/T310874 [14:34:30] 10SRE, 10Data Engineering Planning, 10Event-Platform Value Stream, 10serviceops, 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) [14:34:41] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:35:48] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::cumin::target: remove port forwarding support [puppet] - 10https://gerrit.wikimedia.org/r/821759 (owner: 10Majavah) [14:36:23] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1068.eqiad.wmnet with reason: host reimage [14:38:42] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [14:39:03] (03CR) 10Andrew Bogott: [C: 03+2] wmcs.quota_increase: fix not needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822736 (owner: 10David Caro) [14:39:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1068.eqiad.wmnet with reason: host reimage [14:44:15] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:47:05] (03Merged) 10jenkins-bot: wmcs.quota_increase: fix not needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822736 (owner: 10David Caro) [14:51:00] (03Abandoned) 10Jforrester: inEventSample: Avoid invalid character warning from sampling code, hash into hex [extensions/WikiEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821745 (https://phabricator.wikimedia.org/T314896) (owner: 10Jforrester) [14:55:37] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 16.92 ms [14:56:25] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:59:41] (03PS3) 10Jbond: O:wikidough: drop wikidough abuse nets [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) [14:59:46] urbanecm: question if you're about ref https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/821249, "extension needs to be present in at least two trains to be addable to extension-list" - *why* does it break scap otherwise? (: [14:59:58] (curious more than anything) [15:01:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1068.eqiad.wmnet with OS bullseye [15:01:15] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1068.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [15:01:23] TheresNoTime: because scap uses the extension list to build i18n cache in both production and labs [15:01:31] and in any stage of the train, we can go one train back [15:01:44] ah so by having two, it'll definitely be there? [15:01:59] (03CR) 10David Caro: [C: 03+2] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [15:02:01] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:27] tbh, I'm not 100% sure if it needs to be "at least two", or "all wmf branches that are present at deployment host". [15:02:40] (03PS2) 10David Caro: pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312 [15:03:00] (since train deployment includes purging old wmf branches from deplyment host, it doesn't matter much, but still) [15:06:19] TheresNoTime: scap rebuilds i18n cache for all versions mentioned in wikiversions.json. and because we only rollback one train back, it's essentially "latest two trains" [15:06:34] ACKNOWLEDGEMENT - HP RAID on ms-be2032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315235 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat [15:06:36] 17:01 ah so by having two, it'll definitely be there? <== so, yes. [15:06:38] 10SRE, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T315235 (10ops-monitoring-bot) [15:06:40] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) @herron is is okay for me to power this server down so i can reset the IDRAC and upgrade it? [15:06:45] urbanecm: makes sense, thank you :) [15:06:50] any time! [15:07:03] PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:51] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:08:31] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) @Papaul yes please proceed [15:12:45] 10SRE, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T315235 (10Papaul) 05Open→03Declined Duplicate of T314427 [15:12:55] (03PS1) 10David Caro: ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168 [15:12:57] (03PS1) 10David Caro: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 [15:16:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10BTullis) @Andrew - I believe that the hadoop-client package and any others on which this work depends have now been packaged and... [15:16:40] (03CR) 10Urbanecm: [C: 04-1] logos/manage.py: Use shortened link in user agent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246 (owner: 10Samtar) [15:17:25] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [15:21:25] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) disconnecting the battery didn't fix the issue. So the servers needs to be decom or buy a new battery no need to power down ms-be2035. This server is back online [15:22:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I need this server depool so i can shut it down to work on this disk issue. [15:22:46] (03CR) 10CI reject: [V: 04-1] ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168 (owner: 10David Caro) [15:24:43] RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:49] (03PS1) 10Ahmon Dancy: Add git-review package to profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/823173 [15:25:57] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [15:26:00] (03CR) 10Btullis: [C: 03+2] maintain-views: Add pagetriage-copyvio to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/815215 (https://phabricator.wikimedia.org/T313281) (owner: 10Zabe) [15:26:20] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:26:51] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) 05Open→03Resolved Complete [15:26:56] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:27:52] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10MPhamWMF) [15:28:35] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [15:29:00] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) 05Open→03Resolved This is complete Closing it, I will open another task for A1 and A8 once I receive the PDU's [15:29:11] (03PS2) 10Ahmon Dancy: Add git-review package to profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/823173 [15:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T1530). [15:31:05] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2032.codfw.wmnet [15:31:05] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2032.codfw.wmnet [15:31:30] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure [15:31:44] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure [15:31:49] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2032.codfw.wmnet [15:31:50] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2032.codfw.wmnet [15:31:50] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0b0cee87-305c-4cd2-acf0-ac3d3f5b8587) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit... [15:32:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: disk fault investigation [15:32:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: disk fault investigation [15:32:28] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=51e5c728-88e2-4e83-acf6-7e651f6e7d29) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit... [15:33:31] RECOVERY - Host logstash2003 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [15:34:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) @Papaul I've shut ms-be2067 down for you to work on it. [ignore the downtime on ms-be2032 here, that was a typo] [15:35:37] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) Ah, OK. These hosts are scheduled for decom (but the cluster needs to be healthy enough for the rebalancing necessary for that to proceed). [15:35:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts logstash2003.codfw.wmnet [15:36:16] (03CR) 10BCornwall: [C: 03+2] admin: Move soworu-01 from ldap-only to analytics [puppet] - 10https://gerrit.wikimedia.org/r/822680 (https://phabricator.wikimedia.org/T313213) (owner: 10BCornwall) [15:36:27] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36739/" [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy) [15:38:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) [15:38:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) a:05Ottomata→03BCornwall [15:39:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) 05In progress→03Resolved Your permissions have been changed and will go into effect after a short period. I'm closing this ticket now but... [15:39:22] (03PS1) 10MVernon: swift: ms-be2028 /dev/sdg1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/823178 (https://phabricator.wikimedia.org/T315213) [15:41:11] (03PS2) 10MVernon: swift: ms-be2028 /dev/sdg1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/823178 (https://phabricator.wikimedia.org/T315213) [15:41:23] (03CR) 10Ahmon Dancy: [C: 04-1] "Thanks for the review Dzahn. I'll make adjustments to address your comments." [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [15:43:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts logstash2003.codfw.wmnet [15:44:02] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @bcampbell, @TAndic, and I had a meeting to try and suss out how this was working previously. As far as we were able to ascertain it appea... [15:45:01] 10SRE, 10Infrastructure-Foundations, 10netops: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10Gehel) We can confirm things are working from the Search Platform point of view. No more work for the Search Platform team, so unassigning... [15:45:07] (03CR) 10Ssingh: [C: 03+1] "Thanks for all the work on this!" [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) (owner: 10Jbond) [15:45:09] (03PS2) 10BryanDavis: Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734 [15:48:10] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10MPhamWMF) [15:48:35] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:26] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) 05Open→03Resolved upgrade IDRAC from 3.21.21.21 5.10.30.00 @herron server is back up thanks [15:50:40] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) Looks much better thank you! [15:53:21] PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:47] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:58:13] RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [16:01:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:37] 10SRE, 10ops-codfw, 10Gerrit, 10decommission-hardware, and 2 others: decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Papaul) [16:02:14] 10SRE, 10ops-codfw, 10Gerrit, 10decommission-hardware, and 2 others: decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Papaul) 05Open→03Resolved complete [16:05:52] (03PS1) 10Jbond: P:redis::slave: pass the password [puppet] - 10https://gerrit.wikimedia.org/r/823181 (https://phabricator.wikimedia.org/T228266) [16:05:54] (03PS1) 10Jbond: P:redis::slave: drop use of inline_template [puppet] - 10https://gerrit.wikimedia.org/r/823182 [16:09:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:10:33] 10SRE, 10Traffic-Icebox: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10BCornwall) a:03BCornwall [16:14:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:17:01] !log dancy@deploy1002 Installing scap version "4.13.0" for 553 hosts [16:17:23] !log dancy@deploy1002 Installation of scap version "4.13.0" completed for 553 hosts [16:18:30] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T315089 (10Papaul) [16:22:07] (03PS4) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [16:23:23] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1082.eqiad.wmnet with OS bullseye [16:23:31] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1082.eqiad.wmnet with OS bullseye [16:23:48] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [16:24:36] (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [16:24:48] (03CR) 10Dzahn: [C: 03+2] Add git-review package to profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy) [16:25:37] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [16:26:53] (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/Package[git-review]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy) [16:27:10] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [16:28:10] (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/823173 (owner: 10Ahmon Dancy) [16:28:25] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:54] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1082.eqiad.wmnet with reason: host reimage [16:36:06] PROBLEM - confd service on sretest1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:37:31] (03CR) 10Andrew Bogott: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:39:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1082.eqiad.wmnet with reason: host reimage [16:53:29] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:55:17] RECOVERY - confd service on sretest1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:00:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1082.eqiad.wmnet with OS bullseye [17:00:46] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1082.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [17:01:59] PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:23] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol2003-dev.wikimedia.org - https://phabricator.wikimedia.org/T315089 (10Papaul) 05Open→03Resolved complete [17:09:23] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:09:27] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:09:47] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:17] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:51] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:10:57] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:05] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:11:05] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:05] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:09] hmm [17:11:13] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:43] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:12:09] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:14:03] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:15:15] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:39] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:45] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:47] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:15:51] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:55] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:55] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [17:15:57] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:17:47] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1052.eqiad.wmnet with OS bullseye [17:17:54] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1052.eqiad.wmnet with OS bullseye [17:18:11] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:15] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:17] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:59] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:19:21] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:20:35] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:21:15] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:21:17] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:21:35] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:22:08] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d4137b5]: increase subgraph query SLA and remove same from drop_old_data [17:22:21] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:23:07] RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:31] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:24:26] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d4137b5]: increase subgraph query SLA and remove same from drop_old_data (duration: 02m 17s) [17:25:11] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:25:21] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:25:25] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:25] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:27:25] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:27:27] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:27:38] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:27:41] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:28:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2067.codfw.wmnet [17:28:15] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2067.codfw.wmnet [17:28:39] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:28:40] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1052.eqiad.wmnet with reason: host reimage [17:28:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2067.codfw.wmnet [17:29:09] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:29:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [17:30:06] 10SRE, 10ops-codfw: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Papaul) @Marostegui this server is out for warranty and I don't have any 1.9TB SSD disk onsite. Thanks [17:32:36] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:32:48] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1052.eqiad.wmnet with reason: host reimage [17:41:01] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I requested for another disk to be sent to me. The server is back up ` Create Dispatch: Success You have successfully submitted request SR148961821. [17:41:58] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Give it a minute i am upgrading the BIOS on it [17:45:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:47:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:48:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1052.eqiad.wmnet with OS bullseye [17:48:59] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1052.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [17:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [18:01:24] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10wiki_willy) a:03Cmjohnson [18:04:02] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:04:16] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:32] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:08] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:07:21] !log thanos compact process was hung, forced thanos-compact restart on thanos-fe2001 [18:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:39] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Create the OpenSSH directory inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [18:09:45] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:09:58] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [18:15:08] (03CR) 10Andrea Denisse: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [18:15:22] (03CR) 10Andrea Denisse: [C: 03+1] tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [18:16:51] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1081.eqiad.wmnet with OS bullseye [18:16:57] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1081.eqiad.wmnet with OS bullseye [18:19:54] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [18:20:57] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [18:22:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [18:23:06] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [18:23:24] (03PS5) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) [18:24:34] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2067.codfw.wmnet [18:28:40] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jhathaway) In my brief testing it appears that ferm is happy to accept bogus IPs and try to load them in with iptables, leaving the box with no rules at all. @jbond how confident are we that e... [18:28:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [18:29:13] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1081.eqiad.wmnet with reason: host reimage [18:30:04] (03PS1) 10Gergő Tisza: WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) [18:31:33] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ms-be2067.codfw.wmnet [18:33:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1081.eqiad.wmnet with reason: host reimage [18:34:25] (03PS1) 10Andrew Bogott: Revert "clouddumps100[12]: move back to 'insetup'" [puppet] - 10https://gerrit.wikimedia.org/r/823199 (https://phabricator.wikimedia.org/T302981) [18:35:44] (03CR) 10Andrew Bogott: [C: 03+2] Revert "clouddumps100[12]: move back to 'insetup'" [puppet] - 10https://gerrit.wikimedia.org/r/823199 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [18:35:56] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 93 probes of 682 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:38:01] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@230a820]: include additional deubgging information in HivePartitionRangeSensor logs [18:40:10] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@230a820]: include additional deubgging information in HivePartitionRangeSensor logs (duration: 02m 08s) [18:44:52] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:46:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 86 probes of 682 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:47:57] ACKNOWLEDGEMENT - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. Brian_King cluster reimage ongoing, this is expected https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:47:58] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:48:18] (03PS1) 10Andrew Bogott: acme_chief: give cloudstore100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823200 (https://phabricator.wikimedia.org/T302981) [18:48:27] (03CR) 10CI reject: [V: 04-1] WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza) [18:49:18] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: give cloudstore100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823200 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [18:49:38] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1081.eqiad.wmnet with OS bullseye [18:49:45] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1081.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [18:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T314041)', diff saved to https://phabricator.wikimedia.org/P32387 and previous config saved to /var/cache/conftool/dbconfig/20220815-185002-ladsgroup.json [18:50:05] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:50:16] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:55:00] (03PS1) 10Andrew Bogott: acme_chief: give clouddumps100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823201 (https://phabricator.wikimedia.org/T302981) [18:56:18] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: give clouddumps100[12] access to the dumps certs [puppet] - 10https://gerrit.wikimedia.org/r/823201 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [19:03:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32388 and previous config saved to /var/cache/conftool/dbconfig/20220815-190508-ladsgroup.json [19:06:16] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:06:49] PROBLEM - NFS on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [19:08:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:12:02] PROBLEM - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [19:14:50] (03CR) 10Gergő Tisza: "recheck: Selenium timeout in Termbox tests, seems unrelated" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza) [19:18:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [19:20:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P32389 and previous config saved to /var/cache/conftool/dbconfig/20220815-192014-ladsgroup.json [19:25:00] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [19:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:34:47] (03PS1) 10Andrew Bogott: Give clouddumps100[12] access to hdfs and rsync things [puppet] - 10https://gerrit.wikimedia.org/r/823208 (https://phabricator.wikimedia.org/T302981) [19:35:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T314041)', diff saved to https://phabricator.wikimedia.org/P32390 and previous config saved to /var/cache/conftool/dbconfig/20220815-193520-ladsgroup.json [19:35:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:35:26] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:35:32] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:35:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:35:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32391 and previous config saved to /var/cache/conftool/dbconfig/20220815-193541-ladsgroup.json [19:36:26] (03PS1) 10Ahmon Dancy: Add placeholder for scap's phabricator API token [labs/private] - 10https://gerrit.wikimedia.org/r/823209 (https://phabricator.wikimedia.org/T315255) [19:36:38] 10SRE, 10Acme-chief, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10BCornwall) a:05Vgutierrez→03BCornwall [19:38:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:38:58] (03CR) 10Andrew Bogott: [C: 03+2] Give clouddumps100[12] access to hdfs and rsync things [puppet] - 10https://gerrit.wikimedia.org/r/823208 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [19:40:18] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Add placeholder for scap's phabricator API token [labs/private] - 10https://gerrit.wikimedia.org/r/823209 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy) [19:40:32] 10SRE, 10Acme-chief, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10BCornwall) I believe that https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/820795 will als... [19:40:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:57] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf) [19:41:41] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf) [19:42:54] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf) @SWakiyama as Purity's manager, please approve her to access private data. [19:44:01] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10nshahquinn-wmf) @odimitrijevic or @Ottomata, can you approve Purity for LDAP-only membership in `analytics-privatedata-users`? [19:47:28] (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Populate /etc/scap/phabricator_token [puppet] - 10https://gerrit.wikimedia.org/r/823210 (https://phabricator.wikimedia.org/T315255) [19:50:23] (03CR) 10Cwhite: [C: 03+2] logstash: use logstash routing for w3creportingapi stream [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [19:51:30] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:52:20] RECOVERY - k8s requests count to the API on ml-serve-ctrl2002 is OK: (C)100 ge (W)50 ge 32.44 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [19:57:37] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10Ottomata) Approved. [19:58:11] (03CR) 10Dzahn: [C: 03+2] profile::mediawiki::deployment::server: Populate /etc/scap/phabricator_token [puppet] - 10https://gerrit.wikimedia.org/r/823210 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy) [19:58:18] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:04] RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T2000). [20:00:04] cjming and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:00:23] i'll deploy o/ [20:00:27] (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Ensure that /etc/scap directory exists [puppet] - 10https://gerrit.wikimedia.org/r/823211 [20:00:36] (03CR) 10Clare Ming: [C: 03+2] Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming) [20:01:15] tgr: do you usually deploy your own patches? happy to do yours if you're around [20:01:15] (03CR) 10CI reject: [V: 04-1] profile::mediawiki::deployment::server: Ensure that /etc/scap directory exists [puppet] - 10https://gerrit.wikimedia.org/r/823211 (owner: 10Ahmon Dancy) [20:01:41] (03Merged) 10jenkins-bot: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming) [20:02:35] cjming: thanks! usually one person does all patches (it's slightly faster that way) [20:03:02] (03CR) 10Clare Ming: [C: 03+2] WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza) [20:04:52] tgr: sounds good [20:05:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:05:55] (03PS2) 10Ahmon Dancy: profile::mediawiki::deployment::server: Move /etc/scap/phabricator_token to class scap::master [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255) [20:08:48] (03CR) 10CI reject: [V: 04-1] profile::mediawiki::deployment::server: Move /etc/scap/phabricator_token to class scap::master [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy) [20:09:40] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:10:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:12] (03PS3) 10Ahmon Dancy: profile::mediawiki::deployment::server: Move stuff to class scap::master [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255) [20:11:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36740/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/823211 (https://phabricator.wikimedia.org/T315255) (owner: 10Ahmon Dancy) [20:12:09] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821310|Enable sticky header edit A/B test for idwiki + viwiki (T312295)]] (duration: 03m 30s) [20:12:12] T312295: Enable sticky header A/B test for idwiki + viwiki - https://phabricator.wikimedia.org/T312295 [20:15:00] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:21:45] (03Merged) 10jenkins-bot: WelcomeSurvey/VariantHooks: Change hook used for redirection [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza) [20:22:50] tgr: your patch is up on mwdebug1002 - can you test? [20:22:55] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) 05Open→03Resolved This is complete [20:23:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [20:26:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:29] cjming: thanks, works [20:26:36] great - syncing [20:28:22] [5ce9b247-7a74-408b-ae14-c047f2e79585] 2022-08-15 20:28:09: Fatal exception of type "TypeError" [20:28:27] twice, on trying to load a user's contribs [20:28:31] with one good load in between [20:28:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:28:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:30:18] (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:43] here [20:31:13] I ACKed it for now [20:31:16] but I am still looking [20:31:28] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments: Backport: [[gerrit:822485|WelcomeSurvey/VariantHooks: Change hook used for redirection (T313064)]] (duration: 04m 37s) [20:31:31] T313064: WelcomeSurvey: Post-login redirect hooks might interfere with central login during signup - https://phabricator.wikimedia.org/T313064 [20:31:39] tgr: should be live! [20:31:41] Tamzin: I don't see anything like that in the logs [20:31:49] cjming: thanks! [20:31:54] was on trying to load https://en.wikipedia.org/wiki/Special:Contributions/Smoking_Ethel [20:32:02] (ProbeDown) firing: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:07] a couple other people mentioned slow loads / exceptions to me [20:32:23] cjming: where are we in the deploy? [20:32:32] The page loads fine for me. [20:32:40] we are done! just about to close the window [20:32:52] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:32:57] it loads a bit slow for me, and then some scripts/gadgets don't load [20:33:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:33:00] ok so something is definitely up [20:33:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:33:03] !log end of UTC late backport window [20:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:30] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [20:33:32] Though indeed surprisingly slow for such a small contributions list. [20:34:55] tgr: sukhe https://phabricator.wikimedia.org/T315260 [20:35:12] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:35:14] actually - is it ok if i deploy one more change? [20:35:18] (ProbeDown) resolved: (8) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:19] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1080.eqiad.wmnet with OS bullseye [20:35:20] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:35:21] TheresNoTime: thanks, looking [20:35:23] The sync itself could explain the fatals. The Growthexperiments patch was very scap unfriendly and caused ~50.000 errors and exceptions while it was synced [20:35:26] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1080.eqiad.wmnet with OS bullseye [20:35:34] cjming: we just had a deploy for WelcomeSurvey no? [20:35:45] TheresNoTime: yes [20:35:55] TheresNoTime, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/822485/ [20:36:36] (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:54] considering we're getting resolves, maybe it was just the sync zabe ? [20:36:55] TheresNoTime: sorry about that. I probably missed an inter-file dependency in the patch. [20:37:02] It should be transient though. [20:37:15] yes [20:37:46] when scaping such changes the files arrive at a random order and for a short period some files are the new version while others are still the old [20:38:46] so for that time period it tried passing a SpecialPageFactory while the old version of WelcomeSurveyHooks.php was still in place which did not accept a SpecialPageFactory resulting in that fatal [20:39:19] we love to see it [20:40:32] I wonder why scap did not abort the sync with such an error rate [20:42:00] RECOVERY - NFS on clouddumps1002 is OK: TCP OK - 0.000 second response time on 208.80.154.71 port 2049 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [20:42:35] RECOVERY - NFS on clouddumps1001 is OK: TCP OK - 0.000 second response time on 208.80.154.142 port 2049 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [20:42:52] or does scap only look at the error rate after the patch was synced to the canaries? [20:43:44] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:44:20] TheresNoTime: i need to revert my config patch from earlier - is it ok to do that now? [20:44:49] cjming: looks good to me, but sukhe is the responding here :) [20:45:16] sukhe: ok if i sneak in a quick revert? [20:45:51] it resolved itself so no issues from SRE's end at least. thanks [20:46:31] (03PS1) 10Clare Ming: Revert "Enable sticky header edit A/B test for idwiki + viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823227 [20:46:32] zabe: WelcomeSurveyHooks has a SpecialPage_initList handler which is called on every request, so the canaries should have errored out. [20:47:10] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:48:18] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1080.eqiad.wmnet with reason: host reimage [20:48:34] ok [20:49:32] (03CR) 10Clare Ming: [C: 03+2] Revert "Enable sticky header edit A/B test for idwiki + viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823227 (owner: 10Clare Ming) [20:50:26] (03Merged) 10jenkins-bot: Revert "Enable sticky header edit A/B test for idwiki + viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823227 (owner: 10Clare Ming) [20:50:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1080.eqiad.wmnet with reason: host reimage [20:53:13] (03PS1) 10Andrew Bogott: clouddumps: don't puppetized '/srv/dumps' mount [puppet] - 10https://gerrit.wikimedia.org/r/823217 (https://phabricator.wikimedia.org/T302981) [20:54:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:55:18] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823227|Revert "Enable sticky header edit A/B test for idwiki + viwiki"]] (duration: 03m 15s) [20:55:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:55:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:58:08] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps: don't puppetized '/srv/dumps' mount [puppet] - 10https://gerrit.wikimedia.org/r/823217 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [20:59:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:00:04] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220815T2100). [21:00:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:02:42] (03PS1) 10Clare Ming: Sticky header AB test bucketing for 2 treatment buckets [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/823228 (https://phabricator.wikimedia.org/T312573) [21:02:51] should this be fully fixed now, or is there residual stuff? i'm still having script issues. namely NavPopups isn't loading [21:04:21] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:04:58] does anyone mind if i deploy 2 more quick changes? or are you all in the middle of something [21:06:18] It should be fully fixed. [21:07:26] hmm. anyone else here use NavPopups and able to reproduce/not? [21:07:29] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:40] Could it be that scap only checks the canaries after the change has been fully synced there? In that case no errors are expected since there is no signature drift. [21:07:48] cjming: we are still in the deploy window, the errors are not happening anymore, should be fine. [21:07:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1080.eqiad.wmnet with OS bullseye [21:07:58] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1080.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [21:08:28] tgr: thanks [21:09:03] Tamzin: NavPopups have always been somewhat unreliable for me. Can you expand on "not loading"? What does mw.inspect() say? [21:09:25] When I hover over a link, it's the same behavior as if I didn't have NavPopups enabled [21:09:32] And, 1 sec, lemme see [21:09:45] zabe: I did get scap failure in a similar situation in the past. [21:10:14] (03PS1) 10Clare Ming: Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823229 [21:10:32] So I'm pretty sure it can detect mid-sync issues (or could in the past, at least). [21:10:47] Never used mw.inspect before. Which report am I looking at? [21:11:09] I see, so it definetly does not feel like it should have went this way [21:11:21] Tamzin: type it in the browser console and you should get a list of which JS modules are loaded. [21:12:40] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [21:13:13] uh... `3 'ext.gadget.Navigation_popups' '165.0 KiB' 168985` [21:13:46] (03PS3) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) [21:14:44] (03PS4) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) [21:15:29] 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10diego) [21:16:26] Tamzin: I tried enabling navpopups on enwiki, seems to work [21:17:20] hmm. maybe something with my machine [21:20:07] you could try adding debug=1 to the URL, clearing localStorage, clearing cookies (although that might log you out) [21:21:57] (03CR) 10Clare Ming: [C: 03+2] Sticky header AB test bucketing for 2 treatment buckets [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/823228 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming) [21:29:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1083.eqiad.wmnet with OS bullseye [21:30:00] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1083.eqiad.wmnet with OS bullseye [21:36:57] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:37:10] (03Merged) 10jenkins-bot: Sticky header AB test bucketing for 2 treatment buckets [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/823228 (https://phabricator.wikimedia.org/T312573) (owner: 10Clare Ming) [21:38:01] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [21:38:23] Hi all! I'm here for the security deployment of T310763, did I miss it or it is still happening? [21:38:50] fyi - i'm still deploying a few changes - should be wrapped up here in a few [21:39:09] awesome :) thanks [21:39:09] 10Puppet, 10Infrastructure-Foundations, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Jdlrobson) [21:39:38] (03CR) 10Gergő Tisza: "Caused T315260." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822485 (https://phabricator.wikimedia.org/T313064) (owner: 10Gergő Tisza) [21:42:10] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.es6: Backport: [[gerrit:823228|Sticky header AB test bucketing for 2 treatment buckets (T312573)]] (duration: 03m 05s) [21:42:14] T312573: Make sticky header edit button A/B test work for no sticky header control group - https://phabricator.wikimedia.org/T312573 [21:42:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:42:17] (03CR) 10Clare Ming: [C: 03+2] Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823229 (owner: 10Clare Ming) [21:42:38] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1083.eqiad.wmnet with reason: host reimage [21:43:03] (03Merged) 10jenkins-bot: Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823229 (owner: 10Clare Ming) [21:43:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:43:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:44:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:45:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1083.eqiad.wmnet with reason: host reimage [21:47:01] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:49:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:50:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:50:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:51:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:54:34] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823229|Revert "Revert "Enable sticky header edit A/B test for idwiki + viwiki""]] (duration: 03m 37s) [21:54:36] (03PS1) 10Dzahn: phabricator: make it possible to globally enable/disable vcs setup [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) [22:00:08] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [22:01:10] (03PS2) 10Dzahn: phabricator: make it possible to globally enable/disable vcs setup [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) [22:01:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1083.eqiad.wmnet with OS bullseye [22:01:39] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1083.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [22:03:40] (03CR) 10Dzahn: [C: 03+2] "most of the diff is unrelated: https://puppet-compiler.wmflabs.org/pcc-worker1003/36743/" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:05:10] (03PS5) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) [22:07:01] (03CR) 10Dzahn: "noop confirmed / double checked on phab1001/phab2001 - will make https://gerrit.wikimedia.org/r/c/operations/puppet/+/685136 not as danger" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:07:43] (03CR) 10Dzahn: [C: 03+2] "noop confirmed / double checked on phab1001/phab2001 - will make https://gerrit.wikimedia.org/r/c/operations/puppet/+/685136 not as danger" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:08:01] (03CR) 10Cwhite: [C: 03+2] logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [22:09:59] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/823220 this now comes without the VCS setup" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:15:44] I'm not sure if T310763 happened?! 😅 [22:23:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [22:24:24] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:28:48] AnaisGueyte, AFAICS it did not [22:29:18] Thank you! [22:29:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36744/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:30:26] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:33:51] !log rsyncing /srv/repos and /srv/dumps from phab1001 to phab2002 before applying prod puppet role (T313360) [22:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:55] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [22:36:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [22:38:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [22:49:35] (03CR) 10Brennen Bearnes: [C: 03+1] "Belated +1, looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:49:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [22:52:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [22:53:37] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:59:40] !log search-loader1001 - killed puppet process that had been running since May [22:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:07] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:29] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [23:03:40] (03PS1) 10Clare Ming: Update sticky header config for idwiki, viwiki A/B experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823268 (https://phabricator.wikimedia.org/T312295) [23:06:55] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:08:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:15:59] PROBLEM - SSH on phab2002 is CRITICAL: connect to address 10.192.32.54 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:16:13] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [23:16:37] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:43] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:18:16] ACKNOWLEDGEMENT - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: ssh.service daniel_zahn horrible - still wasnt enough to avoid using the vcs class https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:16] ACKNOWLEDGEMENT - SSH on phab2002 is CRITICAL: connect to address 10.192.32.54 and port 22: Connection refused daniel_zahn horrible - still wasnt enough to avoid using the vcs class https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:18:51] (03CR) 10Dzahn: [C: 03+2] "this is soooo annoying.. even after the previous change and reading all the compiler output this STILL caused the exact issues I was worri" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:20:56] !log phab2002 - manually removing service IP addresses for git-ssh.codfw.wikimedia.org which were added by puppet even after gerrit:823220 (!) T280597 [23:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:00] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [23:25:38] (03PS5) 10Cwhite: logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) [23:26:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:29:16] (03CR) 10Cwhite: [C: 03+2] logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [23:35:05] (03PS1) 10Dzahn: Revert "site: add phabricator role to phab2002" [puppet] - 10https://gerrit.wikimedia.org/r/823230 [23:35:24] (03CR) 10Dzahn: [C: 03+2] Revert "site: add phabricator role to phab2002" [puppet] - 10https://gerrit.wikimedia.org/r/823230 (owner: 10Dzahn) [23:37:46] (03CR) 10Tim Starling: [C: 03+2] Switch www.mediawiki.org to multi-DC mode [puppet] - 10https://gerrit.wikimedia.org/r/823113 (owner: 10Tim Starling) [23:39:43] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:40:46] (03CR) 10Dzahn: [C: 03+2] "even after this a new host will STILL assign git-ssh IP addresses (or try to and fail) :( reverting" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:41:03] RECOVERY - SSH on phab2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:41:57] (03CR) 10Dzahn: [C: 03+2] "a setup where applying a puppet role for a service modifies /etc/ssh/sshd_config is just ..argg" [puppet] - 10https://gerrit.wikimedia.org/r/823220 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:45:30] PROBLEM - PHD should be running on phab2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:48:07] PROBLEM - PHD should be supervising processes on phab2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:50:07] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:50:47] ^ phab2002 - removed from icinga [23:57:20] (03PS2) 10Tim Starling: Discovery: codfw should be pooled for api-ro and appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/818652 (https://phabricator.wikimedia.org/T279664)