[00:00:03] <tgr>	 it's only run manually so not risky.
[00:00:31] <wikibugs>	 (03PS1) 10Gergő Tisza: fixLinkRecommendationData: Try harder to avoid >10K result sets [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716491 (https://phabricator.wikimedia.org/T284531)
[00:00:50] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:53] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] fixLinkRecommendationData: Try harder to avoid >10K result sets [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716491 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza)
[00:02:13] <wikibugs>	 (03CR) 10Legoktm: "Untested" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm)
[00:05:48] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:36] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:34] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:13:29] <legoktm>	 fixed deploy_to_mwdebug
[00:13:41] <legoktm>	 well, reset it
[00:14:18] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:04] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Growth: Define wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716431 (https://phabricator.wikimedia.org/T289054) (owner: 10Urbanecm)
[00:18:31] <wikibugs>	 (03Merged) 10jenkins-bot: fixLinkRecommendationData: Try harder to avoid >10K result sets [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716491 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza)
[00:20:17] <wikibugs>	 (03PS2) 10Cwhite: logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635
[00:21:53] <wikibugs>	 (03PS1) 10Cwhite: logstash: temporariliy reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637
[00:23:34] <wikibugs>	 (03PS2) 10Cwhite: logstash: temporariliy reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637
[00:23:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:17] <wikibugs>	 (03PS3) 10Cwhite: logstash: temporarily reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637
[00:25:15] <wikibugs>	 (03PS3) 10Cwhite: logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635
[00:25:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:09] <logmsgbot>	 !log tgr@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Backport: [[gerrit:716491|fixLinkRecommendationData: Try harder to avoid >10K result sets (T284531)]] (duration: 00m 58s)
[00:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:15] <stashbot>	 T284531: Add Link: Work around 10K search result set limit in fixLinkRecommendationData.php - https://phabricator.wikimedia.org/T284531
[00:50:56] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: temporarily reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637 (owner: 10Cwhite)
[01:00:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635 (owner: 10Cwhite)
[01:01:07] <wikibugs>	 (03PS4) 10Cwhite: logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635
[01:44:06] <wikibugs>	 (03Abandoned) 10Juan90264: Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264)
[02:20:25] <wikibugs>	 (03PS2) 10Krinkle: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[02:20:28] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[02:20:59] <wikibugs>	 (03PS3) 10Krinkle: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[02:21:15] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[02:27:30] <wikibugs>	 (03PS1) 10Gergő Tisza: Run GrowthExperiments fixLinkRecommendationData --dry-run every day [puppet] - 10https://gerrit.wikimedia.org/r/716755 (https://phabricator.wikimedia.org/T283868)
[02:32:58] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[04:05:34] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:06:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:16:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:17:12] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:26:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:50:24] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 14.08 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[05:01:58] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 11.09 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[05:04:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2138 for upgrade', diff saved to https://phabricator.wikimedia.org/P17192 and previous config saved to /var/cache/conftool/dbconfig/20210903-050423-marostegui.json
[05:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17193 and previous config saved to /var/cache/conftool/dbconfig/20210903-051124-root.json
[05:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:29] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[05:11:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17194 and previous config saved to /var/cache/conftool/dbconfig/20210903-051149-root.json
[05:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:10] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove pc2007 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/716915 (https://phabricator.wikimedia.org/T289112)
[05:20:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2007.codfw.wmnet
[05:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove pc2007 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/716915 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui)
[05:23:12] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.51 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[05:26:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17195 and previous config saved to /var/cache/conftool/dbconfig/20210903-052628-root.json
[05:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:33] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[05:26:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17196 and previous config saved to /var/cache/conftool/dbconfig/20210903-052653-root.json
[05:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:30] <logmsgbot>	 !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts pc2007.codfw.wmnet
[05:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:39] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update pcX-master [dns] - 10https://gerrit.wikimedia.org/r/716936 (https://phabricator.wikimedia.org/T284825)
[05:36:42] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 13.94 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[05:41:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17198 and previous config saved to /var/cache/conftool/dbconfig/20210903-054131-root.json
[05:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:41:37] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[05:41:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17199 and previous config saved to /var/cache/conftool/dbconfig/20210903-054157-root.json
[05:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17200 and previous config saved to /var/cache/conftool/dbconfig/20210903-055635-root.json
[05:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:40] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[05:57:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17201 and previous config saved to /var/cache/conftool/dbconfig/20210903-055700-root.json
[05:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:09] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650)
[06:05:13] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.86 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[06:11:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17202 and previous config saved to /var/cache/conftool/dbconfig/20210903-061138-root.json
[06:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:44] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[06:12:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17203 and previous config saved to /var/cache/conftool/dbconfig/20210903-061204-root.json
[06:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:43] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:20:35] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:21:21] <wikibugs>	 (03CR) 10Elukey: Introduce the secrets helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey)
[06:27:14] <wikibugs>	 (03CR) 10Elukey: "Hello Folks! Thanks a lot for the follow up, I completely understand why you are forking everything but it seems not super DRY. I didn't m" [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[06:33:29] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:51] <wikibugs>	 (03PS9) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835)
[06:38:43] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:20] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1004 is OK: (C)10 ge (W)5 ge 2.426 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[06:40:15] <wikibugs>	 (03CR) 10Elukey: "@JMeybohm: should be ready for another pass, lemme know if I forgot anything!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[06:43:50] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2264 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T290242
[06:45:25] <elukey>	 !log run `apt-get clean` on cp5012 to free some space (94% of the root partition used)
[06:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:13] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:48:43] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:51:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update pcX-master [dns] - 10https://gerrit.wikimedia.org/r/716936 (https://phabricator.wikimedia.org/T284825) (owner: 10Marostegui)
[06:57:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[06:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210903T0700)
[07:00:08] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295)
[07:01:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:58] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui)
[07:05:02] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) This is ready for #dc-ops
[07:05:23] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) a:05Marostegui→03wiki_willy
[07:05:58] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) a:05wiki_willy→03Papaul
[07:10:57] <godog>	 !log more weight to ms-be20[62-65] - T288458
[07:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:01] <stashbot>	 T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458
[07:16:09] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi) p:05Triage→03Medium
[07:17:14] <jynus>	 I have increased the eqiad backup speed to 14 threads, as fridays tend to be of lower load
[07:17:47] <jynus>	 there was some weird load graph from 0h to 7h, but that was me, it was mainly writes
[07:19:31] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:20:52] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Switchover db2090 with db2110 [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650)
[07:21:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update repository hook for Gitlab 14 [puppet] - 10https://gerrit.wikimedia.org/r/716346 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff)
[07:21:17] <wikibugs>	 10SRE, 10wikitech.wikimedia.org, 10Sustainability (Incident Followup): Incident response tools operational readiness review - https://phabricator.wikimedia.org/T290130 (10LSobanski) p:05Triage→03Medium a:03LSobanski
[07:23:08] <wikibugs>	 (03PS1) 10JMeybohm: mediawiki-dev: Run setup-db as helm hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066
[07:23:18] <wikibugs>	 (03PS1) 10Zabe: Typo fix: 'the the' -> 'the' [deployment-charts] - 10https://gerrit.wikimedia.org/r/717067 (https://phabricator.wikimedia.org/T201491)
[07:26:52] <wikibugs>	 (03Abandoned) 10Gehel: Make retries less verbose. [software/spicerack] - 10https://gerrit.wikimedia.org/r/491531 (owner: 10Gehel)
[07:26:55] <wikibugs>	 (03Abandoned) 10Gehel: [WIP] log relocating shards during cluster restart [software/spicerack] - 10https://gerrit.wikimedia.org/r/492307 (owner: 10Gehel)
[07:30:12] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add serving chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835)
[07:33:51] <wikibugs>	 (03PS2) 10Elukey: knative-serving: add secrets chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835)
[07:38:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Introduce the secrets helm chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[07:42:54] <marostegui>	 !log Remove flaggedrevs_stats2 and flaggedrevs_stats from severak s3 wikis - T289050
[07:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:58] <stashbot>	 T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050
[07:43:15] <legoktm>	 !log uploaded pygments 2.10.0+dfsg-1~wmf1 to apt.wm.o in component/pygments
[07:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:22] <wikibugs>	 (03PS3) 10Elukey: knative-serving: add secrets chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835)
[07:49:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Increased threads to 14 -7 on each worker. We now get a backup speed of over 44 files/s, which would imply a pending c...
[07:53:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100
[07:55:34] <godog>	 up for volunteers for review ^
[07:55:51] <elukey>	 checking
[07:57:49] <godog>	 cheers elukey 
[07:59:37] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-fgiunchedi: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi)
[08:01:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: POC sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[08:04:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721
[08:05:53] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Bump deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717115 (https://phabricator.wikimedia.org/T205870)
[08:12:40] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff)
[08:12:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "mathoid: Pin the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716361 (owner: 10Alexandros Kosiaris)
[08:13:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Bump deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717115 (https://phabricator.wikimedia.org/T205870) (owner: 10Alexandros Kosiaris)
[08:15:02] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631273 (owner: 10PipelineBot)
[08:15:09] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640516 (owner: 10PipelineBot)
[08:15:15] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640523 (owner: 10PipelineBot)
[08:15:20] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/651764 (owner: 10PipelineBot)
[08:15:25] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/653264 (owner: 10PipelineBot)
[08:15:32] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/653525 (owner: 10PipelineBot)
[08:15:36] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/654282 (owner: 10PipelineBot)
[08:15:41] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/654312 (owner: 10PipelineBot)
[08:15:45] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/654316 (owner: 10PipelineBot)
[08:15:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mathoid: Pin the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716361 (owner: 10Alexandros Kosiaris)
[08:15:50] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677846 (owner: 10PipelineBot)
[08:15:54] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677942 (owner: 10PipelineBot)
[08:15:59] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704136 (owner: 10PipelineBot)
[08:16:05] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704141 (owner: 10PipelineBot)
[08:16:10] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704140 (owner: 10PipelineBot)
[08:16:16] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704145 (owner: 10PipelineBot)
[08:16:20] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704144 (owner: 10PipelineBot)
[08:16:22] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: Bump deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717115 (https://phabricator.wikimedia.org/T205870) (owner: 10Alexandros Kosiaris)
[08:17:27] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715933 (owner: 10PipelineBot)
[08:19:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Fix typo in version deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/717125
[08:19:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] mathoid: Fix typo in version deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/717125 (owner: 10Alexandros Kosiaris)
[08:23:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
[08:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:03] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 87 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:35:40] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] prometheus: couple mysqld export service to mariadb (multi-instance) [puppet] - 10https://gerrit.wikimedia.org/r/716306 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[08:35:49] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:37:44] <wikibugs>	 (03PS4) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305)
[08:41:03] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[08:43:41] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 68 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:43:41] <wikibugs>	 (03Merged) 10jenkins-bot: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[08:45:19] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1022.eqiad.wmnet
[08:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:45] <ema>	 !log cp-eqsin: clean apt cache to free up some space T290305
[08:45:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:49] <stashbot>	 T290305: Low root disk space on multiple eqsin cp nodes - https://phabricator.wikimedia.org/T290305
[08:47:26] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) Thanks a lot for the detailed update @jbond!
[08:49:35] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:52:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[08:52:09] <wikibugs>	 (03PS4) 10Elukey: knative-serving: add secrets chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835)
[08:52:20] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[08:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:55] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[08:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:44] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 67 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:54:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi)
[09:00:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond)
[09:00:52] <wikibugs>	 (03CR) 10Muehlenhoff: Debian: Add support for bookworm as a valid codename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713615 (owner: 10MVernon)
[09:01:56] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 72 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:02:10] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249)
[09:03:00] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:47] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez)
[09:06:01] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[09:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:18] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 48 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:07:51] <vgutierrez>	 pylint going anal  again, sigh
[09:08:36] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:22] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@4ff8979]: Analytics hotfix deploy [analytics/refinery@4ff8979]
[09:09:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:40] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:51] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[09:09:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:28] <wikibugs>	 (03Abandoned) 10MVernon: Debian: Add support for bookworm as a valid codename [puppet] - 10https://gerrit.wikimedia.org/r/713615 (owner: 10MVernon)
[09:13:24] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:13:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:24] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1022.eqiad.wmnet
[09:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:31] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1022.eqiad.wmnet` - mc1022.eqiad.wmnet (**...
[09:15:04] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:11] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10akosiaris)
[09:18:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:19:02] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:19:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:20:06] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:22:16] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 84 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:22:44] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki)
[09:25:02] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc[1025-1026].eqiad.wmnet
[09:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:11] <elukey>	 jelto: o/ are the above latency increases related to the helm deployment? 
[09:26:31] <jelto>	 elukey: I'm looking already but I could not find anything yet. I deployed some RBAC changes (cluster roles and rolebindings). Still looking
[09:26:58] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@4ff8979]: Analytics hotfix deploy [analytics/refinery@4ff8979] (duration: 17m 36s)
[09:27:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:12] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 57 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:28:32] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:28:50] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:29:30] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:29:38] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:29:47] <elukey>	 jelto: ack lemme know if you need help! IIRC we had a similar thing yesterday or a couple of days ago, all self recovered 
[09:30:21] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[09:31:37] <wikibugs>	 (03PS3) 10Jbond: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950
[09:31:40] <wikibugs>	 (03PS4) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771)
[09:31:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond)
[09:32:10] <wikibugs>	 (03CR) 10Jgiannelos: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[09:32:35] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@4ff8979] (thin): Analytics hotfix deploy THIN [analytics/refinery@4ff8979]
[09:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:43] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@4ff8979] (thin): Analytics hotfix deploy THIN [analytics/refinery@4ff8979] (duration: 00m 07s)
[09:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:55] <wikibugs>	 (03CR) 10Jgiannelos: "Except of the enable canary events I also adapted the patch based on https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/716219" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[09:33:29] <jelto>	 elukey: thanks! My guess would be that helmfile apply raised the latency in eqiad clusters because it has done quite a lot of changes.. but it's also quite delayed to the actual deploy. Now all latencys are back to normal 
[09:34:35] <elukey>	 yes could be what happened, among the metrics I saw a rise of "events" when latency went up
[09:34:52] <elukey>	 anyway, if it is a temp spike for RBAC it is ok
[09:34:56] <elukey>	 let's keep it monitored
[09:37:44] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 78 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:40:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs20
[09:40:58] <icinga-wm>	 .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:41:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are mar
[09:41:06] <icinga-wm>	  but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:41:18] <elukey>	 dcausse, gehel ---^
[09:41:19] <dcausse>	 ?
[09:41:25] <elukey>	 hello :)
[09:41:35] <dcausse>	 :)
[09:41:58] <elukey>	 it seems that some wdqs servers are overloaded (not responding to health checks)
[09:42:08] <gehel>	 Oops
[09:43:31] <gehel>	 internal cluster also has higher than usual load
[09:44:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Going to merge and test it, lemme know if anything looks off!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[09:44:07] <gehel>	 looks like an increase in imported triples as well
[09:44:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:47:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:49:34] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 38 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:55:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100
[09:56:12] <wikibugs>	 (03PS2) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249)
[09:56:14] <wikibugs>	 (03PS1) 10Vgutierrez: embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227
[09:57:44] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[09:57:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:53] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 09s)
[09:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:22] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[09:58:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:16] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) So far the [[ https://grafana.wikimedia.org/d/000000477/puppetdb?viewPanel=7&orgId=1&from=now-7d&to=now | graph ]] is looking much health...
[10:00:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227 (owner: 10Vgutierrez)
[10:00:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez)
[10:00:15] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 01m 53s)
[10:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:18] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[10:01:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:43] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 01m 25s)
[10:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:34] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[10:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:10] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 36s)
[10:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:02] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[10:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:48] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 45s)
[10:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:57] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[10:10:04] <wikibugs>	 (03PS2) 10Vgutierrez: embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227
[10:10:06] <wikibugs>	 (03PS3) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249)
[10:12:50] <wikibugs>	 (03CR) 10Kosta Harlan: "Looks correct. Is there any reason to gradually remove this over the course of a week? I suppose it should be OK to remove en masse but we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm)
[10:13:59] <volans>	 effie: the alert above seems related to your changes, is it possible is waiting for user input?
[10:16:37] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:16:41] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[10:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:18] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 36s)
[10:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) >>! In T289606#7329757, @EYener wrote: > Hi all! Thank you for working on this and granting access for @JMando! We have been working from the level of access authorized, which you'v...
[10:21:16] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[10:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:11] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 55s)
[10:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:01] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 37 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:23:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) p:05Triage→03Medium
[10:24:02] <wikibugs>	 (03PS12) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[10:24:04] <wikibugs>	 (03PS17) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[10:24:06] <wikibugs>	 (03PS19) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[10:24:08] <wikibugs>	 (03PS7) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[10:24:10] <wikibugs>	 (03PS1) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:24:38] <wikibugs>	 (03PS13) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[10:24:46] <wikibugs>	 (03PS18) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[10:24:57] <wikibugs>	 (03PS20) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[10:25:06] <wikibugs>	 (03PS8) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[10:25:13] <wikibugs>	 (03PS2) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:25:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[10:25:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:26:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[10:28:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[10:29:28] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
[10:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:31] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 03s)
[10:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:30:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) FTR I emailed the user and contact via email to confirm
[10:31:25] <wikibugs>	 (03PS3) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:31:45] <wikibugs>	 (03PS4) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:33:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30997/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:34:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:38:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Handle non dict YAML documents as well [software/service-checker] - 10https://gerrit.wikimedia.org/r/717249
[10:38:26] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10fgiunchedi) >>! In T290223#7329001, @Legoktm wrote: > Thanks for taking a look :) I don't really understand why spamassasin added the X-Spam-Report header in th...
[10:39:43] <wikibugs>	 (03PS1) 10Volans: ipmi: refactor class signature [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250
[10:39:45] <wikibugs>	 (03PS1) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251
[10:40:13] <wikibugs>	 (03PS9) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[10:41:28] <wikibugs>	 (03PS5) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:41:39] <wikibugs>	 (03PS6) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:42:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[10:42:45] <wikibugs>	 (03PS1) 10Muehlenhoff: labs_bootstrapvz: Install emacs-nox instead of emacs [puppet] - 10https://gerrit.wikimedia.org/r/717252
[10:44:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:45:25] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-test): Deploy latest code on AQS new servers - test after failures
[10:45:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[10:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:30] <logmsgbot>	 !log joal@deploy1002 deploy aborted: Deploy latest code on AQS new servers - test after failures (duration: 00m 05s)
[10:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:46:28] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): Deploy latest code on AQS new servers - test after failures
[10:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:01] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): Deploy latest code on AQS new servers - test after failures (duration: 00m 32s)
[10:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:09] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[10:48:14] <wikibugs>	 (03PS2) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251
[10:49:53] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[10:51:47] <wikibugs>	 (03PS7) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:53:03] <wikibugs>	 (03PS8) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[10:53:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31002/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:54:13] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.038 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[10:54:57] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc[1025-1026].eqiad.wmnet
[10:54:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:03] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc[1025-1026].eqiad.wmnet` - mc1025.eqiad.wm...
[10:55:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:55:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[10:56:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC still running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31003" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[10:56:41] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[10:58:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc[1028-1032].eqiad.wmnet
[10:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:29] <wikibugs>	 (03PS3) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251
[11:00:46] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro)
[11:03:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM and a nice improvement thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250 (owner: 10Volans)
[11:04:18] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.041 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:05:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[11:10:09] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 1.128 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:13:59] <wikibugs>	 (03CR) 10Volans: "Great, black seems to not be pep8-compliant here..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[11:14:30] <wikibugs>	 (03CR) 10Jbond: "another nit and a typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[11:15:55] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro)
[11:17:23] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:19:43] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:24:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/716306 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[11:27:01] <wikibugs>	 (03CR) 10Jbond: "thanks for the updates will look to roll this (and the next one) out Monday" [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond)
[11:27:43] <wikibugs>	 (03PS5) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904)
[11:28:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Dzahn) 05Open→03Resolved a:03Dzahn This should be resolved now.
[11:29:04] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[11:29:44] <wikibugs>	 (03PS6) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904)
[11:30:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff)
[11:30:49] <wikibugs>	 (03CR) 10Jbond: "updated thanks all will look to merge on Monday.  once merged the facts should disappear from puppetdb after the next puppet run." [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond)
[11:35:25] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA
[11:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:33] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA (duration: 01m 07s)
[11:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:06] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA
[11:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:52] <marostegui>	 !log Remove flaggedrevs_stats2 and flaggedrevs_stats from enwiki - T289050
[11:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:57] <stashbot>	 T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050
[11:43:24] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:43:28] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:43:46] <majavah>	 dcausse: ^ related?
[11:43:51] <dcausse>	 majavah: yes
[11:44:09] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@7208d3d]: Analytics hotfix deploy (bis)[analytics/refinery@7208d3d]
[11:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:23] <dcausse>	 wdqs in codfw is currently mostly down
[11:45:02] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:45:06] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:45:54] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:46:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:47:38] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:48:34] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:48:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:49:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:49:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:50:50] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:52:14] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) 05Open→03Resolved
[11:52:17] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[11:52:34] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) 05Open→03Resolved
[11:54:30] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:54:53] <wikibugs>	 (03PS10) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[11:56:20] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[11:56:27] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA (duration: 19m 21s)
[11:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[11:57:14] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:59:20] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[11:59:56] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[12:00:06] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:00:56] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:02:21] <wikibugs>	 (03CR) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[12:02:38] <wikibugs>	 (03PS1) 10Ema: rsyslog: stop saving trafficserver-tls logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305)
[12:02:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Designate new mc canary [puppet] - 10https://gerrit.wikimedia.org/r/717312
[12:03:08] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[12:03:25] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@7208d3d]: Analytics hotfix deploy (bis)[analytics/refinery@7208d3d] (duration: 19m 16s)
[12:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:50] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@7208d3d] (thin): Analytics hotfix deploy (bis) THIN [analytics/refinery@7208d3d]
[12:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:56] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:03:57] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@7208d3d] (thin): Analytics hotfix deploy (bis) THIN [analytics/refinery@7208d3d] (duration: 00m 06s)
[12:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:05:38] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[12:06:19] <wikibugs>	 (03CR) 10Jbond: O:base::resolving: make nameservers mandatory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[12:08:10] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[12:08:44] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] O:base::resolving: make nameservers mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[12:12:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc[1028-1032].eqiad.wmnet
[12:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:00] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc[1028-1032].eqiad.wmnet` - mc1028.eqiad.wm...
[12:13:56] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] O:base::resolving: make nameservers mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[12:23:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] query_service: remove absented query-service-gc-log-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[12:26:50] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:30:18] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) >>! In T251305#7319375, @elukey wrote: > Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call `...
[12:32:23] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc[1035-1036].eqiad.wmnet
[12:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:15] <wikibugs>	 (03PS1) 10Dzahn: add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538)
[12:33:17] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] d/changelog: prepare 0.23 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/716221 (owner: 10Majavah)
[12:34:33] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: prepare 0.23 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/716221 (owner: 10Majavah)
[12:35:40] <wikibugs>	 (03CR) 10Effie Mouzeli: "I had this in Iba888921391ec33e0bd7caadf435ae453c34ae5f, I wanted to decom all hosts and update it, but this will do" [puppet] - 10https://gerrit.wikimedia.org/r/717312 (owner: 10Muehlenhoff)
[12:35:43] <wikibugs>	 (03PS2) 10Dzahn: add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538)
[12:35:59] <wikibugs>	 (03PS2) 10Effie Mouzeli: Designate new mc canary [puppet] - 10https://gerrit.wikimedia.org/r/717312 (owner: 10Muehlenhoff)
[12:41:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[12:43:29] <wikibugs>	 (03Merged) 10jenkins-bot: add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[12:45:22] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Designate new mc canary [puppet] - 10https://gerrit.wikimedia.org/r/717312 (owner: 10Muehlenhoff)
[12:46:22] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc[1035-1036].eqiad.wmnet
[12:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:29] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc[1035-1036].eqiad.wmnet` - mc1035.eqiad.wm...
[12:47:15] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki)
[12:48:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1023.eqiad.wmnet
[12:48:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:22] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1023.eqiad.wmnet
[13:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:41] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1023.eqiad.wmnet` - mc1023.eqiad.wmnet (**...
[13:04:07] <wikibugs>	 (03CR) 10Ema: "I've discussed this on IRC with Filippo and he pointed out the alternative approach of using the "stop" statement in 20-trafficserver.conf" [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema)
[13:04:28] <wikibugs>	 (03CR) 10Jgiannelos: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[13:04:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs20
[13:04:40] <icinga-wm>	 .wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:04:53] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet
[13:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:43] <wikibugs>	 (03PS14) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[13:07:45] <wikibugs>	 (03PS4) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251
[13:07:47] <wikibugs>	 (03PS1) 10Volans: setup.py: revert upper limit for regex [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383
[13:09:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:10:59] <dcausse>	 !log installing openjdk-8-dbg on wdqs2007 
[13:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:02] <icinga-wm>	 PROBLEM - DNS on mc1026.mgmt is CRITICAL: Domain mc1026.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:11:43] <effie>	 ^ me
[13:11:46] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1027.eqiad.wmnet
[13:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:52] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1027.eqiad.wmnet` - mc1027.eqiad.wmnet (**...
[13:13:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[13:14:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs20
[13:14:12] <icinga-wm>	 .wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:15:34] <wikibugs>	 (03CR) 10Volans: "This passes locally on 3.9, I'll check with all the combinations later" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[13:16:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:16:59] <wikibugs>	 (03PS3) 10Effie Mouzeli: Decommission old eqiad memcached hosts [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657)
[13:17:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Extra comments LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema)
[13:18:11] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki)
[13:19:06] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:20:06] <wikibugs>	 (03CR) 10Elukey: "This was of course wrong, I have to study a bit more helmfile :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[13:20:11] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki)
[13:20:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) Thanks! Error for me when I `ssh -v stat1007.eqiad.wmnet`:  `eyener@wmf2395 ~ % ssh stat1007 -v eqiad.wmnet OpenSSH_8.1p1, LibreSSL 2.7.3 debug1: Reading configuration data /Users/eyene...
[13:20:56] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) a:05jijiki→03Cmjohnson
[13:24:19] <wikibugs>	 (03PS19) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[13:30:08] <wikibugs>	 (03PS1) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412
[13:31:01] <wikibugs>	 (03PS2) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412
[13:32:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10Ottomata) @EYener  you typed  `ssh stat1007 -v eqiad.wmnet`, try `ssh -v stat1007.eqiad.wmnet` :)
[13:37:12] <wikibugs>	 (03PS3) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412
[13:37:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) Oh yes. That will help. :) Connected, thank you!
[13:37:58] <wikibugs>	 (03CR) 10Ottomata: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[13:40:40] <wikibugs>	 (03CR) 10Ottomata: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[13:45:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: Install emacs-nox instead of emacs [puppet] - 10https://gerrit.wikimedia.org/r/717252 (owner: 10Muehlenhoff)
[13:49:49] <wikibugs>	 (03PS9) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[13:50:54] <wikibugs>	 (03CR) 10Jbond: base::resolving: convert base::resolving to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[13:51:45] <wikibugs>	 (03CR) 10Ottomata: "BTW, you gave me a reason to write:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[13:52:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[13:53:52] <wikibugs>	 (03PS1) 10Majavah: Swap emacs with emacs-nox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422
[13:54:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Swap emacs with emacs-nox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 (owner: 10Majavah)
[13:59:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:04:15] <wikibugs>	 (03CR) 10Jbond: O:base::resolver: unify resolv.conf templates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:05:11] <wikibugs>	 (03PS4) 10Effie Mouzeli: Decommission old eqiad memcached hosts [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657)
[14:11:50] <icinga-wm>	 PROBLEM - Host mw2264.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:12:06] <wikibugs>	 (03PS1) 10Elukey: knative-serving: fix the istio_secrets template [deployment-charts] - 10https://gerrit.wikimedia.org/r/717435
[14:12:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) Fantastic, thank you @EYener and @Ottomata ! Awaiting confirmation of access from @JMando
[14:12:22] <icinga-wm>	 RECOVERY - Host mw2264.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms
[14:12:26] <wikibugs>	 (03PS15) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[14:12:28] <wikibugs>	 (03PS20) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:12:30] <wikibugs>	 (03PS21) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:12:32] <wikibugs>	 (03PS11) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[14:12:34] <wikibugs>	 (03PS10) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[14:13:36] <icinga-wm>	 RECOVERY - Host mw2264 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms
[14:13:46] <icinga-wm>	 PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:14:18] <wikibugs>	 (03CR) 10Andrew Bogott: O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:15:02] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2264 is CRITICAL: CRITICAL: 320 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[14:16:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: fix the istio_secrets template [deployment-charts] - 10https://gerrit.wikimedia.org/r/717435 (owner: 10Elukey)
[14:16:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:17:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[14:17:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:18:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:18:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:34] <icinga-wm>	 RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:19:48] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:20:13] <mutante>	 !log mw2264 - scap pull
[14:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:38] <papaul>	 mutante: thanks
[14:20:52] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2264 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[14:21:04] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 7.285 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:23:01] <mutante>	 papaul: thanks as well
[14:23:03] <jbond>	 hi all just a heads up i plan ti disable puppet at 15:00 to kick opf the puppetdb maintance work
[14:23:25] <mutante>	 thanks jbond 
[14:23:33] <wikibugs>	 (03PS21) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:23:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383 (owner: 10Volans)
[14:25:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro)
[14:25:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm (of course when tests are fixed)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[14:26:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:27:34] <papaul>	 yes 
[14:29:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[14:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:31] <wikibugs>	 (03PS2) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352)
[14:32:37] <wikibugs>	 (03PS22) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:32:39] <wikibugs>	 (03PS12) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[14:32:41] <wikibugs>	 (03PS11) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[14:35:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:36:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) I am in now. I can access those UI's and successfully ssh into stat1007. Thank you!
[14:36:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[14:36:22] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff)
[14:37:36] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Papaul) @Dzahn fist let us swap A1 with B1 and see if we still have the error on A1. Memory swap complete and IDRAC upgrade from 2.50 to 2.80. i will leave the task open for now until next week.  thanks
[14:38:18] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Papaul)
[14:38:28] <wikibugs>	 (03PS13) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498)
[14:38:59] <wikibugs>	 (03PS3) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352)
[14:39:30] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2264 is CRITICAL: Host mw2264 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:39:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) 05Open→03Resolved
[14:40:23] <wikibugs>	 (03Abandoned) 10Hashar: allow useful Jenkins URLs [puppet] - 10https://gerrit.wikimedia.org/r/629417 (https://phabricator.wikimedia.org/T178458) (owner: 10CDanis)
[14:40:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:41:15] <wikibugs>	 (03Abandoned) 10Hashar: gerrit: Add option to enable developer auth [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox)
[14:41:17] <wikibugs>	 (03PS12) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[14:43:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[14:44:04] <wikibugs>	 (03PS4) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412
[14:46:16] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:46:55] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Waiting for Jeena's approval as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 (owner: 10JMeybohm)
[14:48:16] <wikibugs>	 (03PS1) 10Urbanecm: foundationwiki: Restrict editing of sensitive namespaces to `editor` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350)
[14:48:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "do not merge (yet), pending adding members to the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) (owner: 10Urbanecm)
[14:50:26] <wikibugs>	 (03PS3) 10BryanDavis: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm)
[14:51:53] <wikibugs>	 (03PS1) 10Elukey: knative-serving: improve the helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/717438 (https://phabricator.wikimedia.org/T289835)
[14:52:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "no-op for prod, docs only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717067 (https://phabricator.wikimedia.org/T201491) (owner: 10Zabe)
[14:52:48] <wikibugs>	 (03PS13) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[14:53:38] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:53:52] <wikibugs>	 (03CR) 10Elukey: "This seems to work on deploy1002 (had to test in there because I wasn't sure what worked and what not)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/717438 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[14:54:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond)
[14:55:21] <wikibugs>	 (03CR) 10Urbanecm: Growth: Remove config that moved on-wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm)
[14:55:38] <wikibugs>	 (03Merged) 10jenkins-bot: Typo fix: 'the the' -> 'the' [deployment-charts] - 10https://gerrit.wikimedia.org/r/717067 (https://phabricator.wikimedia.org/T201491) (owner: 10Zabe)
[14:56:22] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 7.352 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[14:56:22] <wikibugs>	 10ops-codfw, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul)
[14:56:38] <wikibugs>	 10ops-codfw, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) p:05Triage→03Medium
[14:56:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "As always, going to merge and test this properly. Please lemme know anything weird :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717438 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[14:58:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:58:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:18] <wikibugs>	 (03PS14) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661)
[14:59:29] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10jcrespo)
[15:00:38] <jbond>	 !log disable puppet fleet wide to preform puppetdb database maintance - T263578
[15:00:39] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) @Papaul let's do that next week, which day/time would work for you?
[15:00:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:42] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[15:12:41] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) @Marostegui will confirm next week with day and time.  Thanks.
[15:14:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Papaul) 05Open→03Resolved Complete
[15:16:34] <icinga-wm>	 PROBLEM - Host puppetdb1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:44] <wikibugs>	 (03PS1) 10Cwhite: logstash: alertmanager: add alertname and summary to labels [puppet] - 10https://gerrit.wikimedia.org/r/717441 (https://phabricator.wikimedia.org/T289356)
[15:17:00] <icinga-wm>	 PROBLEM - Host puppetdb2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:17:20] <jbond>	 !log create lvm snapshot puppetdb1002_data_snapshot on ganeti1012 - T263578
[15:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:26] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[15:20:38] <icinga-wm>	 RECOVERY - Host puppetdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[15:21:04] <jbond>	 !log create lvm snapshot puppetdb2002_data_snapshot on ganeti2023 - T263578
[15:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:01] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) created snapshots as a roll back stratagy post vacum  ` name=ganeti1012  $ sudo lvdisplay ganeti/puppetdb1002_data_snapshot...
[15:22:09] <wikibugs>	 (03PS1) 10Cwhite: logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356)
[15:22:26] <icinga-wm>	 RECOVERY - Host puppetdb2002 is UP: PING OK - Packet loss = 0%, RTA = 31.80 ms
[15:23:42] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:06] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:19] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) `VACUUM FULL VERBOSE ANALYZE; ` is running on pupetdb1002 in a tmux session under my user
[15:32:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm)
[15:32:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) The commands are hanging to disconnect this server from the cluster, so I have to reboot it in order to break the link. I've downtimed it in...
[15:32:58] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.052 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:36:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) That was successful:  ` root@labstore1004:~# drbd-overview  1:test/0   StandAlone Primary/Unknown UpToDate/DUnknown /srv/test  ext4 9.8G 535M...
[15:36:40] <wikibugs>	 (03CR) 10Herron: rsyslog: stop saving trafficserver-tls logs to disk (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema)
[15:40:38] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:41:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) The urgency is moderate here. The NFS service for the cloud (a core function of #toolforge ) is still up and should function fine. This is th...
[15:42:08] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:42:26] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:43:58] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:44:04] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:44:22] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:44:28] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:46:17] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#7331105, @jbond wrote: > `VACUUM FULL VERBOSE ANALYZE; ` is running on pupetdb1002 in a tmux session under my user  Well t...
[15:49:12] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:50:19] <wikibugs>	 (03PS1) 10Herron: thanos::rule: add cluster_site:sli_etcd_http_error_ratio:rate5m recording rule [puppet] - 10https://gerrit.wikimedia.org/r/717473 (https://phabricator.wikimedia.org/T289615)
[15:51:53] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: catch unhandled exception [cookbooks] - 10https://gerrit.wikimedia.org/r/717475 (https://phabricator.wikimedia.org/T290326)
[15:53:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:53:28] <jbond>	 !log enable puppet fleet wide to post puppetdb database maintance - T263578
[15:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:32] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[15:56:00] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: alertmanager: add alertname and summary to labels [puppet] - 10https://gerrit.wikimedia.org/r/717441 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[15:57:02] <wikibugs>	 (03PS2) 10Cwhite: logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356)
[15:57:28] <wikibugs>	 (03PS4) 10BryanDavis: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm)
[16:02:44] <wikibugs>	 (03PS1) 10Elukey: istio: change node port for HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/717482 (https://phabricator.wikimedia.org/T289835)
[16:03:03] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm)
[16:06:22] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm)
[16:06:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] istio: change node port for HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/717482 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[16:08:12] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717483
[16:10:13] <gehel>	 !log blazegraph (public cofdfw cluster) will now restart every hour - T290330
[16:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:18] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[16:10:20] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] query service: Fix loading of DCATAP file [puppet] - 10https://gerrit.wikimedia.org/r/715696 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse)
[16:14:04] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:55] <wikibugs>	 (03PS1) 10Cwhite: logstash: route aqs and restbase logs to default ecs indexes [puppet] - 10https://gerrit.wikimedia.org/r/717489 (https://phabricator.wikimedia.org/T234565)
[16:18:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:18:17] <wikibugs>	 (03PS1) 10Cwhite: logstash: route gitlab logs to default indexes [puppet] - 10https://gerrit.wikimedia.org/r/717490 (https://phabricator.wikimedia.org/T274462)
[16:19:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:19:02] <wikibugs>	 (03PS2) 10Cwhite: logstash: route aqs and restbase logs to default ecs indexes [puppet] - 10https://gerrit.wikimedia.org/r/717489 (https://phabricator.wikimedia.org/T234565)
[16:19:06] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:19:06] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:19:18] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:20:25] <wikibugs>	 (03PS3) 10Volans: sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885)
[16:20:28] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:23:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: route aqs and restbase logs to default ecs indexes [puppet] - 10https://gerrit.wikimedia.org/r/717489 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[16:24:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: route gitlab logs to default indexes [puppet] - 10https://gerrit.wikimedia.org/r/717490 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite)
[16:26:20] <wikibugs>	 (03CR) 10Volans: "addressed comments, this implies already the changes made in I8357ef4524bc3841cd45126c51479daf60f50cc2" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[16:29:46] <wikibugs>	 (03PS1) 10Jgreen: Re-enable icinga monitoring on payments1008, adding check_ssl_staging [puppet] - 10https://gerrit.wikimedia.org/r/717492 (https://phabricator.wikimedia.org/T289869)
[16:32:41] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
[16:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:41] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330)
[16:35:33] <wikibugs>	 (03CR) 10Jforrester: "I imagine we'd also want to restrict the file and template namespaces, given the impact that edits there will have on the content?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) (owner: 10Urbanecm)
[16:36:36] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Re-enable icinga monitoring on payments1008, adding check_ssl_staging [puppet] - 10https://gerrit.wikimedia.org/r/717492 (https://phabricator.wikimedia.org/T289869) (owner: 10Jgreen)
[16:37:08] <wikibugs>	 (03PS2) 10Urbanecm: foundationwiki: Restrict editing of sensitive namespaces to `editor` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350)
[16:40:29] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330)
[16:40:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] foundationwiki: Restrict editing of sensitive namespaces to `editor` group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) (owner: 10Urbanecm)
[16:41:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[16:42:01] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330)
[16:42:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[16:42:38] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330)
[16:45:48] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330)
[16:46:29] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[16:47:02] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[16:47:36] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[16:48:51] <wikibugs>	 (03PS1) 10Urbanecm: foundationwiki: Restrict uploading to editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717497 (https://phabricator.wikimedia.org/T205350)
[16:51:07] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717483 (owner: 10PipelineBot)
[16:53:41] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717483 (owner: 10PipelineBot)
[16:55:30] <wikibugs>	 (03CR) 10Ryan Kemper: "Small issue with randomizedsecdelay syntax, reverting to fix" [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[16:55:47] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "wdqs: temp mitigation => restart hourly w random" [puppet] - 10https://gerrit.wikimedia.org/r/717451
[16:57:17] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Revert "wdqs: temp mitigation => restart hourly w random" [puppet] - 10https://gerrit.wikimedia.org/r/717451 (owner: 10Ryan Kemper)
[16:57:46] <wikibugs>	 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808)
[17:04:57] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10hnowlan) I think we're probably not doing this for now - please reopen if you feel strongly!
[17:05:07] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10hnowlan) 05Open→03Declined
[17:09:19] <wikibugs>	 (03PS1) 10Urbanecm: [WIP] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347)
[17:09:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "do not merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm)
[17:09:56] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] [WIP] Connect foundationwiki to SUL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm)
[17:10:30] <wikibugs>	 (03PS2) 10Urbanecm: [WIP] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347)
[17:10:43] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717508 (https://phabricator.wikimedia.org/T290330)
[17:10:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] [WIP] Connect foundationwiki to SUL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm)
[17:12:22] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717508 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[17:17:17] <wikibugs>	 (03PS1) 10Jgreen: remove deprecated payments.frdev.wikimedia.org A record [dns] - 10https://gerrit.wikimedia.org/r/717510
[17:17:55] <ryankemper>	 !log T290330 Deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/717508 across `wdqs` fleet; codfw wdqs hosts will restart on average once per hour now to address ongoing availability issues for wdqs codfw
[17:18:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:01] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[17:18:11] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Enable CentralAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347)
[17:20:22] <wikibugs>	 (03CR) 10Jgreen: [V: 03+2 C: 03+2] remove deprecated payments.frdev.wikimedia.org A record [dns] - 10https://gerrit.wikimedia.org/r/717510 (owner: 10Jgreen)
[17:21:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:23] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:29] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:41] <wikibugs>	 (03PS2) 10Urbanecm: [beta] Enable CentralAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347)
[17:21:47] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:25] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:37] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:39] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:39] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:39] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:43] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:54] <ryankemper>	 ^ These are cropping up as an implementation detail, sorry for the noise, fixing now
[17:23:05] <ryankemper>	 (Silencing wdqs1* briefly in the meantime)
[17:30:25] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[17:31:01] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: return 0 exit code for non-codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/717512
[17:32:41] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: return 0 exit code for non-codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/717512 (owner: 10Ryan Kemper)
[17:35:09] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:36] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[17:35:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:01] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:43] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:57] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:57] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:57] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:01] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:19] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:21] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:37] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[17:40:22] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[17:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:20] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[17:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:27] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "This seems like a serious win for smaller images." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 (owner: 10Majavah)
[18:16:17] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 04-1] "This appears to cause a release that is stuck in install/upgrade/rollback mode if the job doesn't succeed. There's no way to modify the re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 (owner: 10JMeybohm)
[18:24:54] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10LSobanski)
[18:28:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) 05Resolved→03Open Hi again! One further question for you all; does @JMando have access to jupyter? The command `ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880` seems to  open a...
[18:43:43] <wikibugs>	 (03CR) 10Jbond: "LGTM, not tested but i think its fine to merge then test" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[18:43:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)
[19:01:10] <wikibugs>	 (03Restored) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[19:01:21] <wikibugs>	 (03PS2) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132)
[19:03:22] <wikibugs>	 (03CR) 10Nikki Nikkhoui: "@Cole would this work for my service in Cloud VPS not in deployment-prep?" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[19:04:43] <ryankemper>	 !log T290330 `ryankemper@cumin1001:~$ sudo -E cumin 'P{wdqs2*}' 'sudo rm -fv /etc/cron.hourly/restart-blazegraph'` (Cleaned up manually created crons now that we have [somewhat hacky] systemd timers doing the same job)
[19:04:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:49] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[19:06:19] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717565
[19:10:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10Ottomata) http://localhost:8880
[19:13:55] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber)
[19:17:27] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1039.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:17:36] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717565 (owner: 10BryanDavis)
[19:20:40] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717565 (owner: 10BryanDavis)
[19:23:06] <wikibugs>	 (03CR) 10Cwhite: Add image suggestion api to lookup table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[19:26:45] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
[19:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:12] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [integration/docroot@6492b3d]: I48480e89e5f6
[19:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:22] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [integration/docroot@6492b3d]: I48480e89e5f6 (duration: 00m 10s)
[19:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:16] <wikibugs>	 (03PS3) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132)
[19:38:51] <wikibugs>	 (03CR) 10Nikki Nikkhoui: "ok" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[19:45:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Whether here in the lookup table or writing a rule in 20-trafficserver.conf, I don't feel strongly." [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema)
[19:50:04] <wikibugs>	 (03PS1) 10Herron: set default slo field values and remove duplicates [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717584
[19:50:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[19:52:06] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] facter networking: filter out cali/tap interfaces [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond)
[19:52:46] <wikibugs>	 (03Abandoned) 10Herron: logstash: route alertmanager alerts to logstash alerts index [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) (owner: 10Herron)
[20:03:39] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: bump container version to 2021-09-03-195018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/717586
[20:06:49] <wikibugs>	 (03PS1) 10Herron: slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036)
[20:11:05] <wikibugs>	 (03CR) 10Nikki Nikkhoui: "Is there someone else that i could add to the patch for a +2? Or do you allow self-merging patches?" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[20:11:37] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container version to 2021-09-03-195018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/717586 (owner: 10BryanDavis)
[20:11:49] <wikibugs>	 (03CR) 10Herron: "here is the varnish dashboard preview for this patch https://grafana.wikimedia.org/dashboard/snapshot/rAagzkSklHmPEZ4qBlHsiaIU0FBTHgdH?org" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron)
[20:13:14] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[20:17:34] <wikibugs>	 (03PS1) 10Mholloway: Convert $wgEventStreams to be an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193)
[20:18:30] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: bump container version to 2021-09-03-195018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/717586 (owner: 10BryanDavis)
[20:19:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Nit inline, but otherwise LGTM" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron)
[20:27:13] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[20:30:07] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
[20:30:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:07] <wikibugs>	 (03CR) 10Ottomata: Convert $wgEventStreams to be an associative array (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway)
[20:51:24] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Disable crawler job in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/717595
[20:56:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) I am able to access with http://localhost:8880. Thank you!
[20:56:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) 05Open→03Resolved Ah, nice. I'll go get my vision checked and close out this task. That you for correcting my numerous typos during this setup process.
[20:56:46] <wikibugs>	 (03PS1) 10Legoktm: nodejs-devel: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717598 (https://phabricator.wikimedia.org/T290209)
[20:57:47] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] nodejs-devel: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717598 (https://phabricator.wikimedia.org/T290209) (owner: 10Legoktm)
[21:03:59] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[21:05:52] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I'm going to merge this one with the understanding that if anyone moves to release, we need to nag me to run my script. It looks like ther" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm)
[21:06:55] <wikibugs>	 (03Merged) 10jenkins-bot: Use common k8s labels [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm)
[21:20:11] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:22:53] <wikibugs>	 (03PS4) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001)
[21:22:55] <wikibugs>	 (03CR) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson)
[21:29:02] <wikibugs>	 (03PS1) 10Ahmon Dancy: check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605
[21:31:51] <wikibugs>	 (03PS1) 10Ebernhardson: Add cname for commons-query.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/717606 (https://phabricator.wikimedia.org/T282117)
[21:34:47] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Disable crawler job in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/717595 (owner: 10BryanDavis)
[21:37:58] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Disable crawler job in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/717595 (owner: 10BryanDavis)
[21:39:11] <wikibugs>	 (03PS1) 10Ahmon Dancy: mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621
[21:40:47] <wikibugs>	 (03CR) 10Ahmon Dancy: "This is an alternate approach to solving the problem described in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/717066" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy)
[21:44:53] <wikibugs>	 (03PS5) 10Ebernhardson: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247)
[21:44:55] <wikibugs>	 (03PS1) 10Ebernhardson: Deploy query_service microsite for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/717630
[21:49:46] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
[21:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:56] <wikibugs>	 (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717644
[22:02:34] <wikibugs>	 (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717648
[22:05:57] <wikibugs>	 (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717650
[22:07:03] <wikibugs>	 (03PS1) 10Krinkle: clinic-duty: Misc JS clean ups [software] - 10https://gerrit.wikimedia.org/r/717651
[22:10:51] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+1] "This is working without causing helm to be stuck doing a release/upgrade for me. Unfortunately when I run rake I get an error, but it's no" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy)
[22:12:36] <wikibugs>	 10SRE-Access-Requests, 10Release-Engineering-Team: Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10dancy)
[22:12:56] <wikibugs>	 10SRE-Access-Requests, 10Release-Engineering-Team: Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10dancy)
[22:12:59] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy)
[22:13:18] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) 05Stalled→03Open
[22:14:36] <wikibugs>	 (03CR) 10Ahmon Dancy: mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy)
[22:15:41] <wikibugs>	 (03PS2) 10Krinkle: clinic-duty: Misc JS clean ups [software] - 10https://gerrit.wikimedia.org/r/717651
[22:21:03] <wikibugs>	 (03PS1) 10Krinkle: clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653
[22:29:59] <wikibugs>	 10ops-codfw, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T290362 (10wiki_willy)
[22:30:12] <wikibugs>	 10ops-codfw, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T290362 (10wiki_willy)
[22:39:14] <wikibugs>	 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10wiki_willy)
[22:39:38] <wikibugs>	 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10wiki_willy)
[22:41:09] <wikibugs>	 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Doing): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10dduvall) 05Open→03Resolved a:03dduvall Resolving since the docs are now up-to-date.
[22:45:01] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10wiki_willy) 05Open→03Resolved a:03wiki_willy Resolving this task.  After talking to Chris, we'll update the eqiad inventory after the next recycling pickup in a...
[22:49:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10Sj) I was just dreaming of having such a backup off-cluster.  Let me know when there is a half-PB of files that I can host. (Ma...
[23:02:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10wiki_willy)
[23:03:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10wiki_willy)
[23:09:12] <wikibugs>	 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10wiki_willy) After talking to John, the ETA to start on this is in a couple weeks - mid September
[23:34:17] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:48] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10Tgr) GerritRobotComments seems to...
[23:37:11] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:38:05] <wikibugs>	 (03PS1) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774)
[23:38:21] <wikibugs>	 (03PS2) 10Krinkle: ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774)
[23:40:08] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Thanks, all LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716553 (owner: 10BryanDavis)
[23:42:07] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Why not ensure => absent /var/lib/mailman/templates and /etc/mailman in mailman::webui too?" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[23:44:27] <wikibugs>	 (03PS2) 10Ladsgroup: mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303)
[23:44:46] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup)
[23:45:01] <wikibugs>	 (03CR) 10Ladsgroup: mailman: Drop listinfo files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[23:45:08] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[23:46:08] <wikibugs>	 (03PS3) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774)
[23:50:25] <wikibugs>	 (03CR) 10Krinkle: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Package[docker-ce]' in p" [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle)
[23:50:43] <Krinkle>	 Amir1: would you happen to know what's up with puppet ^
[23:51:00] <Amir1>	 let me take a look
[23:51:42] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Remind me to merge this on Tuesday in case I forget" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[23:52:03] <Krinkle>	 I've cherry-picked it onto integration-puppetmaster-02
[23:52:18] <Krinkle>	 and then run puppet agent -tv on the new qemu-1002 instance
[23:54:20] <Amir1>	 Krinkle: the lines that errors is 54
[23:54:26] <Amir1>	 which is hard-coding docker-ce
[23:55:07] <Amir1>	 I assume this should be skipped somehow?
[23:55:14] <Krinkle>	 oh down there
[23:55:18] <Amir1>	 or I'm misunderstanding you
[23:55:27] <Krinkle>	 Yes, no, you're absolutely right
[23:55:29] <Krinkle>	 this shoudl be simple
[23:56:22] <wikibugs>	 (03PS4) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774)
[23:56:30] <Amir1>	 happy to be a rubber duck :D
[23:56:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ci: Add 'bulleye' to docker lsbdistcodename hack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle)
[23:57:33] <Krinkle>	 not yet :)
[23:58:12] <wikibugs>	 (03PS5) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774)
[23:58:52] <Amir1>	 now? :D
[23:59:09] <Krinkle>	 Amir1: maybe in a few days
[23:59:24] <Amir1>	 cool
[23:59:25] <Krinkle>	 This is a lot of new stuff I'm hacking together first.
[23:59:31] <Krinkle>	 might change my mind etc.