[00:00:03] it's only run manually so not risky. [00:00:31] (03PS1) 10Gergő Tisza: fixLinkRecommendationData: Try harder to avoid >10K result sets [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716491 (https://phabricator.wikimedia.org/T284531) [00:00:50] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:53] (03CR) 10Gergő Tisza: [C: 03+2] fixLinkRecommendationData: Try harder to avoid >10K result sets [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716491 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza) [00:02:13] (03CR) 10Legoktm: "Untested" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm) [00:05:48] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:36] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:34] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:29] fixed deploy_to_mwdebug [00:13:41] well, reset it [00:14:18] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:04] (03CR) 10Gergő Tisza: [C: 03+1] Growth: Define wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716431 (https://phabricator.wikimedia.org/T289054) (owner: 10Urbanecm) [00:18:31] (03Merged) 10jenkins-bot: fixLinkRecommendationData: Try harder to avoid >10K result sets [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716491 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza) [00:20:17] (03PS2) 10Cwhite: logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635 [00:21:53] (03PS1) 10Cwhite: logstash: temporariliy reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637 [00:23:34] (03PS2) 10Cwhite: logstash: temporariliy reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637 [00:23:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:17] (03PS3) 10Cwhite: logstash: temporarily reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637 [00:25:15] (03PS3) 10Cwhite: logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635 [00:25:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:09] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Backport: [[gerrit:716491|fixLinkRecommendationData: Try harder to avoid >10K result sets (T284531)]] (duration: 00m 58s) [00:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:15] T284531: Add Link: Work around 10K search result set limit in fixLinkRecommendationData.php - https://phabricator.wikimedia.org/T284531 [00:50:56] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:30] (03CR) 10Cwhite: [C: 03+2] logstash: temporarily reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716637 (owner: 10Cwhite) [01:00:57] (03CR) 10Cwhite: [C: 03+2] logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635 (owner: 10Cwhite) [01:01:07] (03PS4) 10Cwhite: logstash: nest_root_fields exclude tags and severity [puppet] - 10https://gerrit.wikimedia.org/r/716635 [01:44:06] (03Abandoned) 10Juan90264: Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [02:20:25] (03PS2) 10Krinkle: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [02:20:28] (03CR) 10Krinkle: [C: 03+1] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [02:20:59] (03PS3) 10Krinkle: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [02:21:15] (03CR) 10Krinkle: [C: 03+1] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [02:27:30] (03PS1) 10Gergő Tisza: Run GrowthExperiments fixLinkRecommendationData --dry-run every day [puppet] - 10https://gerrit.wikimedia.org/r/716755 (https://phabricator.wikimedia.org/T283868) [02:32:58] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [04:05:34] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:06:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:12] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:26:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:24] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 14.08 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:01:58] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 11.09 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:04:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2138 for upgrade', diff saved to https://phabricator.wikimedia.org/P17192 and previous config saved to /var/cache/conftool/dbconfig/20210903-050423-marostegui.json [05:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17193 and previous config saved to /var/cache/conftool/dbconfig/20210903-051124-root.json [05:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:29] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [05:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17194 and previous config saved to /var/cache/conftool/dbconfig/20210903-051149-root.json [05:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:10] (03PS1) 10Marostegui: mariadb: Remove pc2007 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/716915 (https://phabricator.wikimedia.org/T289112) [05:20:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2007.codfw.wmnet [05:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove pc2007 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/716915 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui) [05:23:12] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.51 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:26:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17195 and previous config saved to /var/cache/conftool/dbconfig/20210903-052628-root.json [05:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:33] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [05:26:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17196 and previous config saved to /var/cache/conftool/dbconfig/20210903-052653-root.json [05:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:30] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts pc2007.codfw.wmnet [05:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:39] (03PS1) 10Marostegui: wmnet: Update pcX-master [dns] - 10https://gerrit.wikimedia.org/r/716936 (https://phabricator.wikimedia.org/T284825) [05:36:42] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 13.94 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:41:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17198 and previous config saved to /var/cache/conftool/dbconfig/20210903-054131-root.json [05:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:37] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [05:41:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17199 and previous config saved to /var/cache/conftool/dbconfig/20210903-054157-root.json [05:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17200 and previous config saved to /var/cache/conftool/dbconfig/20210903-055635-root.json [05:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:40] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [05:57:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17201 and previous config saved to /var/cache/conftool/dbconfig/20210903-055700-root.json [05:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:09] (03PS2) 10Marostegui: mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650) [06:05:13] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.86 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:11:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17202 and previous config saved to /var/cache/conftool/dbconfig/20210903-061138-root.json [06:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:44] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17203 and previous config saved to /var/cache/conftool/dbconfig/20210903-061204-root.json [06:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:43] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:20:35] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:21:21] (03CR) 10Elukey: Introduce the secrets helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [06:27:14] (03CR) 10Elukey: "Hello Folks! Thanks a lot for the follow up, I completely understand why you are forking everything but it seems not super DRY. I didn't m" [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [06:33:29] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:51] (03PS9) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) [06:38:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:20] RECOVERY - Persistent high iowait on labstore1004 is OK: (C)10 ge (W)5 ge 2.426 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:40:15] (03CR) 10Elukey: "@JMeybohm: should be ready for another pass, lemme know if I forgot anything!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [06:43:50] ACKNOWLEDGEMENT - Host mw2264 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T290242 [06:45:25] !log run `apt-get clean` on cp5012 to free some space (94% of the root partition used) [06:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:43] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:00] (03CR) 10Marostegui: [C: 03+2] wmnet: Update pcX-master [dns] - 10https://gerrit.wikimedia.org/r/716936 (https://phabricator.wikimedia.org/T284825) (owner: 10Marostegui) [06:57:43] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210903T0700) [07:00:08] (03PS1) 10Urbanecm: Growth: Remove config that moved on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) [07:01:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:58] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) [07:05:02] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) This is ready for #dc-ops [07:05:23] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) a:05Marostegui→03wiki_willy [07:05:58] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Marostegui) a:05wiki_willy→03Papaul [07:10:57] !log more weight to ms-be20[62-65] - T288458 [07:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:01] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [07:16:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi) p:05Triage→03Medium [07:17:14] I have increased the eqiad backup speed to 14 threads, as fridays tend to be of lower load [07:17:47] there was some weird load graph from 0h to 7h, but that was me, it was mainly writes [07:19:31] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:20:52] (03PS2) 10Marostegui: wmnet: Switchover db2090 with db2110 [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650) [07:21:06] (03CR) 10Muehlenhoff: [C: 03+2] Update repository hook for Gitlab 14 [puppet] - 10https://gerrit.wikimedia.org/r/716346 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff) [07:21:17] 10SRE, 10wikitech.wikimedia.org, 10Sustainability (Incident Followup): Incident response tools operational readiness review - https://phabricator.wikimedia.org/T290130 (10LSobanski) p:05Triage→03Medium a:03LSobanski [07:23:08] (03PS1) 10JMeybohm: mediawiki-dev: Run setup-db as helm hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 [07:23:18] (03PS1) 10Zabe: Typo fix: 'the the' -> 'the' [deployment-charts] - 10https://gerrit.wikimedia.org/r/717067 (https://phabricator.wikimedia.org/T201491) [07:26:52] (03Abandoned) 10Gehel: Make retries less verbose. [software/spicerack] - 10https://gerrit.wikimedia.org/r/491531 (owner: 10Gehel) [07:26:55] (03Abandoned) 10Gehel: [WIP] log relocating shards during cluster restart [software/spicerack] - 10https://gerrit.wikimedia.org/r/492307 (owner: 10Gehel) [07:30:12] (03PS1) 10Elukey: knative-serving: add serving chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) [07:33:51] (03PS2) 10Elukey: knative-serving: add secrets chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) [07:38:49] (03CR) 10JMeybohm: [C: 03+1] Introduce the secrets helm chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [07:42:54] !log Remove flaggedrevs_stats2 and flaggedrevs_stats from severak s3 wikis - T289050 [07:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:58] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [07:43:15] !log uploaded pygments 2.10.0+dfsg-1~wmf1 to apt.wm.o in component/pygments [07:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:22] (03PS3) 10Elukey: knative-serving: add secrets chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) [07:49:51] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Increased threads to 14 -7 on each worker. We now get a backup speed of over 44 files/s, which would imply a pending c... [07:53:49] (03PS1) 10Filippo Giunchedi: clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 [07:55:34] up for volunteers for review ^ [07:55:51] checking [07:57:49] cheers elukey [07:59:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-fgiunchedi: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi) [08:01:17] (03CR) 10Filippo Giunchedi: POC sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [08:04:08] (03PS2) 10Muehlenhoff: Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 [08:05:53] (03PS1) 10Alexandros Kosiaris: mathoid: Bump deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717115 (https://phabricator.wikimedia.org/T205870) [08:12:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [08:12:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "mathoid: Pin the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716361 (owner: 10Alexandros Kosiaris) [08:13:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Bump deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717115 (https://phabricator.wikimedia.org/T205870) (owner: 10Alexandros Kosiaris) [08:15:02] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631273 (owner: 10PipelineBot) [08:15:09] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640516 (owner: 10PipelineBot) [08:15:15] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640523 (owner: 10PipelineBot) [08:15:20] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/651764 (owner: 10PipelineBot) [08:15:25] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/653264 (owner: 10PipelineBot) [08:15:32] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/653525 (owner: 10PipelineBot) [08:15:36] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/654282 (owner: 10PipelineBot) [08:15:41] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/654312 (owner: 10PipelineBot) [08:15:45] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/654316 (owner: 10PipelineBot) [08:15:48] (03Merged) 10jenkins-bot: Revert "mathoid: Pin the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716361 (owner: 10Alexandros Kosiaris) [08:15:50] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677846 (owner: 10PipelineBot) [08:15:54] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677942 (owner: 10PipelineBot) [08:15:59] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704136 (owner: 10PipelineBot) [08:16:05] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704141 (owner: 10PipelineBot) [08:16:10] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704140 (owner: 10PipelineBot) [08:16:16] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704145 (owner: 10PipelineBot) [08:16:20] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/704144 (owner: 10PipelineBot) [08:16:22] (03Merged) 10jenkins-bot: mathoid: Bump deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717115 (https://phabricator.wikimedia.org/T205870) (owner: 10Alexandros Kosiaris) [08:17:27] (03Abandoned) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715933 (owner: 10PipelineBot) [08:19:19] (03PS1) 10Alexandros Kosiaris: mathoid: Fix typo in version deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/717125 [08:19:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] mathoid: Fix typo in version deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/717125 (owner: 10Alexandros Kosiaris) [08:23:22] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [08:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:03] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 87 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:35:40] (03CR) 10Kormat: [C: 03+1] prometheus: couple mysqld export service to mariadb (multi-instance) [puppet] - 10https://gerrit.wikimedia.org/r/716306 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [08:35:49] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:37:44] (03PS4) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) [08:41:03] (03CR) 10Jelto: [C: 03+2] helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:43:41] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 68 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:43:41] (03Merged) 10jenkins-bot: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:45:19] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1022.eqiad.wmnet [08:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:45] !log cp-eqsin: clean apt cache to free up some space T290305 [08:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:49] T290305: Low root disk space on multiple eqsin cp nodes - https://phabricator.wikimedia.org/T290305 [08:47:26] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) Thanks a lot for the detailed update @jbond! [08:49:35] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:00] (03CR) 10Elukey: [C: 03+2] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [08:52:09] (03PS4) 10Elukey: knative-serving: add secrets chart to helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) [08:52:20] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:55] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 67 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:54:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi) [09:00:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond) [09:00:52] (03CR) 10Muehlenhoff: Debian: Add support for bookworm as a valid codename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713615 (owner: 10MVernon) [09:01:56] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 72 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:02:10] (03PS1) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) [09:03:00] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:47] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:53] (03CR) 10jerkins-bot: [V: 04-1] acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [09:06:01] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [09:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:18] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 48 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:07:51] pylint going anal again, sigh [09:08:36] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:22] !log joal@deploy1002 Started deploy [analytics/refinery@4ff8979]: Analytics hotfix deploy [analytics/refinery@4ff8979] [09:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:40] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:51] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [09:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:28] (03Abandoned) 10MVernon: Debian: Add support for bookworm as a valid codename [puppet] - 10https://gerrit.wikimedia.org/r/713615 (owner: 10MVernon) [09:13:24] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1022.eqiad.wmnet [09:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:31] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1022.eqiad.wmnet` - mc1022.eqiad.wmnet (**... [09:15:04] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:11] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10akosiaris) [09:18:08] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:19:02] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:19:18] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:20:06] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:22:16] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 84 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:22:44] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) [09:25:02] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc[1025-1026].eqiad.wmnet [09:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:11] jelto: o/ are the above latency increases related to the helm deployment? [09:26:31] elukey: I'm looking already but I could not find anything yet. I deployed some RBAC changes (cluster roles and rolebindings). Still looking [09:26:58] !log joal@deploy1002 Finished deploy [analytics/refinery@4ff8979]: Analytics hotfix deploy [analytics/refinery@4ff8979] (duration: 17m 36s) [09:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 57 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:28:32] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:28:50] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:29:30] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:29:38] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:29:47] jelto: ack lemme know if you need help! IIRC we had a similar thing yesterday or a couple of days ago, all self recovered [09:30:21] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [09:31:37] (03PS3) 10Jbond: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 [09:31:40] (03PS4) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [09:31:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond) [09:32:10] (03CR) 10Jgiannelos: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [09:32:35] !log joal@deploy1002 Started deploy [analytics/refinery@4ff8979] (thin): Analytics hotfix deploy THIN [analytics/refinery@4ff8979] [09:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:43] !log joal@deploy1002 Finished deploy [analytics/refinery@4ff8979] (thin): Analytics hotfix deploy THIN [analytics/refinery@4ff8979] (duration: 00m 07s) [09:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:55] (03CR) 10Jgiannelos: "Except of the enable canary events I also adapted the patch based on https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/716219" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [09:33:29] elukey: thanks! My guess would be that helmfile apply raised the latency in eqiad clusters because it has done quite a lot of changes.. but it's also quite delayed to the actual deploy. Now all latencys are back to normal [09:34:35] yes could be what happened, among the metrics I saw a rise of "events" when latency went up [09:34:52] anyway, if it is a temp spike for RBAC it is ok [09:34:56] let's keep it monitored [09:37:44] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 78 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:40:58] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs20 [09:40:58] .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:41:06] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are mar [09:41:06] but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:41:18] dcausse, gehel ---^ [09:41:19] ? [09:41:25] hello :) [09:41:35] :) [09:41:58] it seems that some wdqs servers are overloaded (not responding to health checks) [09:42:08] Oops [09:43:31] internal cluster also has higher than usual load [09:44:03] (03CR) 10Elukey: [C: 03+2] "Going to merge and test it, lemme know if anything looks off!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [09:44:07] looks like an increase in imported triples as well [09:44:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:47:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:34] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 38 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:55:01] (03PS2) 10Filippo Giunchedi: clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 [09:56:12] (03PS2) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) [09:56:14] (03PS1) 10Vgutierrez: embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227 [09:57:44] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [09:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:53] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 09s) [09:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:22] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [09:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) So far the [[ https://grafana.wikimedia.org/d/000000477/puppetdb?viewPanel=7&orgId=1&from=now-7d&to=now | graph ]] is looking much health... [10:00:04] (03CR) 10jerkins-bot: [V: 04-1] embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227 (owner: 10Vgutierrez) [10:00:07] (03CR) 10jerkins-bot: [V: 04-1] acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [10:00:15] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 01m 53s) [10:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:18] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [10:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:43] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 01m 25s) [10:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:34] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [10:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:10] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 36s) [10:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:02] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [10:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:48] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 45s) [10:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:57] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:10:04] (03PS2) 10Vgutierrez: embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227 [10:10:06] (03PS3) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) [10:12:50] (03CR) 10Kosta Harlan: "Looks correct. Is there any reason to gradually remove this over the course of a week? I suppose it should be OK to remove en masse but we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm) [10:13:59] effie: the alert above seems related to your changes, is it possible is waiting for user input? [10:16:37] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:16:41] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [10:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 36s) [10:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:22] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) >>! In T289606#7329757, @EYener wrote: > Hi all! Thank you for working on this and granting access for @JMando! We have been working from the level of access authorized, which you'v... [10:21:16] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [10:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:11] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 55s) [10:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 37 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:23:20] 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) p:05Triage→03Medium [10:24:02] (03PS12) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [10:24:04] (03PS17) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [10:24:06] (03PS19) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [10:24:08] (03PS7) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [10:24:10] (03PS1) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:24:38] (03PS13) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [10:24:46] (03PS18) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [10:24:57] (03PS20) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [10:25:06] (03PS8) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [10:25:13] (03PS2) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:25:49] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [10:25:56] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:26:22] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [10:28:11] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [10:29:28] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts [10:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:31] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 03s) [10:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:57] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:30:15] 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) FTR I emailed the user and contact via email to confirm [10:31:25] (03PS3) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:31:45] (03PS4) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:33:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30997/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:34:15] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:38:12] (03PS1) 10Alexandros Kosiaris: Handle non dict YAML documents as well [software/service-checker] - 10https://gerrit.wikimedia.org/r/717249 [10:38:26] 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10fgiunchedi) >>! In T290223#7329001, @Legoktm wrote: > Thanks for taking a look :) I don't really understand why spamassasin added the X-Spam-Report header in th... [10:39:43] (03PS1) 10Volans: ipmi: refactor class signature [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250 [10:39:45] (03PS1) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 [10:40:13] (03PS9) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [10:41:28] (03PS5) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:41:39] (03PS6) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:42:32] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [10:42:45] (03PS1) 10Muehlenhoff: labs_bootstrapvz: Install emacs-nox instead of emacs [puppet] - 10https://gerrit.wikimedia.org/r/717252 [10:44:21] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:45:25] !log joal@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-test): Deploy latest code on AQS new servers - test after failures [10:45:25] (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [10:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:30] !log joal@deploy1002 deploy aborted: Deploy latest code on AQS new servers - test after failures (duration: 00m 05s) [10:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:46:28] !log joal@deploy1002 Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): Deploy latest code on AQS new servers - test after failures [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:01] !log joal@deploy1002 Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): Deploy latest code on AQS new servers - test after failures (duration: 00m 32s) [10:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:09] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:48:14] (03PS2) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 [10:49:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:51:47] (03PS7) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:53:03] (03PS8) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:53:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31002/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:54:13] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.038 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:54:57] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc[1025-1026].eqiad.wmnet [10:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:03] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc[1025-1026].eqiad.wmnet` - mc1025.eqiad.wm... [10:55:14] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:55:16] (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [10:56:18] (03CR) 10Jbond: [V: 03+1] "PCC still running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31003" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:56:41] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:58:03] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc[1028-1032].eqiad.wmnet [10:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:29] (03PS3) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 [11:00:46] 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro) [11:03:45] (03CR) 10Jbond: [C: 03+1] "LGTM and a nice improvement thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717250 (owner: 10Volans) [11:04:18] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.041 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:05:37] (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [11:10:09] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 1.128 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:13:59] (03CR) 10Volans: "Great, black seems to not be pep8-compliant here..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [11:14:30] (03CR) 10Jbond: "another nit and a typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [11:15:55] 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro) [11:17:23] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:19:43] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:24:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/716306 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [11:27:01] (03CR) 10Jbond: "thanks for the updates will look to roll this (and the next one) out Monday" [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [11:27:43] (03PS5) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904) [11:28:25] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Dzahn) 05Open→03Resolved a:03Dzahn This should be resolved now. [11:29:04] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [11:29:44] (03PS6) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904) [11:30:47] (03CR) 10Dzahn: [C: 03+1] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [11:30:49] (03CR) 10Jbond: "updated thanks all will look to merge on Monday. once merged the facts should disappear from puppetdb after the next puppet run." [puppet] - 10https://gerrit.wikimedia.org/r/715943 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [11:35:25] !log dcausse@deploy1002 Started deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA [11:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] !log dcausse@deploy1002 Finished deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA (duration: 01m 07s) [11:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:06] !log dcausse@deploy1002 Started deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA [11:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:52] !log Remove flaggedrevs_stats2 and flaggedrevs_stats from enwiki - T289050 [11:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:57] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [11:43:24] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:43:28] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:43:46] dcausse: ^ related? [11:43:51] majavah: yes [11:44:09] !log joal@deploy1002 Started deploy [analytics/refinery@7208d3d]: Analytics hotfix deploy (bis)[analytics/refinery@7208d3d] [11:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:23] wdqs in codfw is currently mostly down [11:45:02] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:45:06] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:45:54] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:46:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:47:38] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:48:34] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:48:56] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:49:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:49:36] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:50:50] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:52:14] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) 05Open→03Resolved [11:52:17] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [11:52:34] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) 05Open→03Resolved [11:54:30] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:54:53] (03PS10) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [11:56:20] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:56:27] !log dcausse@deploy1002 Finished deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA (duration: 19m 21s) [11:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:51] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [11:57:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:59:20] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:59:56] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:00:06] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:00:56] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:02:21] (03CR) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [12:02:38] (03PS1) 10Ema: rsyslog: stop saving trafficserver-tls logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) [12:02:41] (03PS1) 10Muehlenhoff: Designate new mc canary [puppet] - 10https://gerrit.wikimedia.org/r/717312 [12:03:08] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:03:25] !log joal@deploy1002 Finished deploy [analytics/refinery@7208d3d]: Analytics hotfix deploy (bis)[analytics/refinery@7208d3d] (duration: 19m 16s) [12:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:50] !log joal@deploy1002 Started deploy [analytics/refinery@7208d3d] (thin): Analytics hotfix deploy (bis) THIN [analytics/refinery@7208d3d] [12:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:56] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:03:57] !log joal@deploy1002 Finished deploy [analytics/refinery@7208d3d] (thin): Analytics hotfix deploy (bis) THIN [analytics/refinery@7208d3d] (duration: 00m 06s) [12:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:44] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:05:38] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:06:19] (03CR) 10Jbond: O:base::resolving: make nameservers mandatory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [12:08:10] (03CR) 10Michael DiPietro: [C: 03+2] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [12:08:44] (03CR) 10Jbond: [C: 04-1] O:base::resolving: make nameservers mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [12:12:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc[1028-1032].eqiad.wmnet [12:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:00] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc[1028-1032].eqiad.wmnet` - mc1028.eqiad.wm... [12:13:56] (03CR) 10Jbond: [C: 04-1] O:base::resolving: make nameservers mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [12:23:14] (03CR) 10Dzahn: [C: 03+1] query_service: remove absented query-service-gc-log-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:26:50] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:30:18] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) >>! In T251305#7319375, @elukey wrote: > Adding a comment in here since I am trying to figure out a similar thing (although I have way less context) for what we'll probably call `... [12:32:23] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc[1035-1036].eqiad.wmnet [12:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:15] (03PS1) 10Dzahn: add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538) [12:33:17] (03CR) 10Majavah: [C: 03+2] d/changelog: prepare 0.23 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/716221 (owner: 10Majavah) [12:34:33] (03Merged) 10jenkins-bot: d/changelog: prepare 0.23 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/716221 (owner: 10Majavah) [12:35:40] (03CR) 10Effie Mouzeli: "I had this in Iba888921391ec33e0bd7caadf435ae453c34ae5f, I wanted to decom all hosts and update it, but this will do" [puppet] - 10https://gerrit.wikimedia.org/r/717312 (owner: 10Muehlenhoff) [12:35:43] (03PS2) 10Dzahn: add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538) [12:35:59] (03PS2) 10Effie Mouzeli: Designate new mc canary [puppet] - 10https://gerrit.wikimedia.org/r/717312 (owner: 10Muehlenhoff) [12:41:28] (03CR) 10Dzahn: [C: 03+2] add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:43:29] (03Merged) 10jenkins-bot: add HTML of the first 10000 bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/717347 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:45:22] (03CR) 10Effie Mouzeli: [C: 03+2] Designate new mc canary [puppet] - 10https://gerrit.wikimedia.org/r/717312 (owner: 10Muehlenhoff) [12:46:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc[1035-1036].eqiad.wmnet [12:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:29] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc[1035-1036].eqiad.wmnet` - mc1035.eqiad.wm... [12:47:15] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) [12:48:32] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1023.eqiad.wmnet [12:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1023.eqiad.wmnet [13:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:41] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1023.eqiad.wmnet` - mc1023.eqiad.wmnet (**... [13:04:07] (03CR) 10Ema: "I've discussed this on IRC with Filippo and he pointed out the alternative approach of using the "stop" statement in 20-trafficserver.conf" [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [13:04:28] (03CR) 10Jgiannelos: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:04:40] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs20 [13:04:40] .wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:04:53] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet [13:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:43] (03PS14) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [13:07:45] (03PS4) 10Volans: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 [13:07:47] (03PS1) 10Volans: setup.py: revert upper limit for regex [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383 [13:09:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:10:59] !log installing openjdk-8-dbg on wdqs2007 [13:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:02] PROBLEM - DNS on mc1026.mgmt is CRITICAL: Domain mc1026.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:11:43] ^ me [13:11:46] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1027.eqiad.wmnet [13:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:52] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1027.eqiad.wmnet` - mc1027.eqiad.wmnet (**... [13:13:03] (03CR) 10jerkins-bot: [V: 04-1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [13:14:12] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs20 [13:14:12] .wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:15:34] (03CR) 10Volans: "This passes locally on 3.9, I'll check with all the combinations later" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [13:16:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:16:59] (03PS3) 10Effie Mouzeli: Decommission old eqiad memcached hosts [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) [13:17:56] (03CR) 10Filippo Giunchedi: "Extra comments LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [13:18:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) [13:19:06] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:20:06] (03CR) 10Elukey: "This was of course wrong, I have to study a bit more helmfile :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717069 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [13:20:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) [13:20:16] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) Thanks! Error for me when I `ssh -v stat1007.eqiad.wmnet`: `eyener@wmf2395 ~ % ssh stat1007 -v eqiad.wmnet OpenSSH_8.1p1, LibreSSL 2.7.3 debug1: Reading configuration data /Users/eyene... [13:20:56] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) a:05jijiki→03Cmjohnson [13:24:19] (03PS19) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [13:30:08] (03PS1) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 [13:31:01] (03PS2) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 [13:32:11] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10Ottomata) @EYener you typed `ssh stat1007 -v eqiad.wmnet`, try `ssh -v stat1007.eqiad.wmnet` :) [13:37:12] (03PS3) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 [13:37:28] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) Oh yes. That will help. :) Connected, thank you! [13:37:58] (03CR) 10Ottomata: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:40:40] (03CR) 10Ottomata: Configure event stream for map tile state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:45:09] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: Install emacs-nox instead of emacs [puppet] - 10https://gerrit.wikimedia.org/r/717252 (owner: 10Muehlenhoff) [13:49:49] (03PS9) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [13:50:54] (03CR) 10Jbond: base::resolving: convert base::resolving to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:51:45] (03CR) 10Ottomata: "BTW, you gave me a reason to write:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:52:10] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:53:52] (03PS1) 10Majavah: Swap emacs with emacs-nox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 [13:54:47] (03CR) 10Andrew Bogott: [C: 03+1] Swap emacs with emacs-nox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 (owner: 10Majavah) [13:59:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:15] (03CR) 10Jbond: O:base::resolver: unify resolv.conf templates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:05:11] (03PS4) 10Effie Mouzeli: Decommission old eqiad memcached hosts [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) [14:11:50] PROBLEM - Host mw2264.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:12:06] (03PS1) 10Elukey: knative-serving: fix the istio_secrets template [deployment-charts] - 10https://gerrit.wikimedia.org/r/717435 [14:12:11] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) Fantastic, thank you @EYener and @Ottomata ! Awaiting confirmation of access from @JMando [14:12:22] RECOVERY - Host mw2264.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [14:12:26] (03PS15) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [14:12:28] (03PS20) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:12:30] (03PS21) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:12:32] (03PS11) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [14:12:34] (03PS10) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [14:13:36] RECOVERY - Host mw2264 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [14:13:46] PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:18] (03CR) 10Andrew Bogott: O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:15:02] PROBLEM - Ensure local MW versions match expected deployment on mw2264 is CRITICAL: CRITICAL: 320 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:16:14] (03CR) 10Elukey: [C: 03+2] knative-serving: fix the istio_secrets template [deployment-charts] - 10https://gerrit.wikimedia.org/r/717435 (owner: 10Elukey) [14:16:33] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:17:08] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:17:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:18:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:34] RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:48] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:20:13] !log mw2264 - scap pull [14:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:38] mutante: thanks [14:20:52] RECOVERY - Ensure local MW versions match expected deployment on mw2264 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:21:04] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 7.285 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:23:01] papaul: thanks as well [14:23:03] hi all just a heads up i plan ti disable puppet at 15:00 to kick opf the puppetdb maintance work [14:23:25] thanks jbond [14:23:33] (03PS21) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:23:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717383 (owner: 10Volans) [14:25:02] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro) [14:25:22] (03CR) 10Jbond: [C: 03+1] "lgtm (of course when tests are fixed)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [14:26:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:27:34] yes [14:29:10] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:31] (03PS2) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352) [14:32:37] (03PS22) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:32:39] (03PS12) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [14:32:41] (03PS11) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [14:35:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:41] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:36:04] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) I am in now. I can access those UI's and successfully ssh into stat1007. Thank you! [14:36:08] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:36:22] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [14:37:36] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Papaul) @Dzahn fist let us swap A1 with B1 and see if we still have the error on A1. Memory swap complete and IDRAC upgrade from 2.50 to 2.80. i will leave the task open for now until next week. thanks [14:38:18] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Papaul) [14:38:28] (03PS13) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [14:38:59] (03PS3) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352) [14:39:30] PROBLEM - mediawiki-installation DSH group on mw2264 is CRITICAL: Host mw2264 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:39:39] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) 05Open→03Resolved [14:40:23] (03Abandoned) 10Hashar: allow useful Jenkins URLs [puppet] - 10https://gerrit.wikimedia.org/r/629417 (https://phabricator.wikimedia.org/T178458) (owner: 10CDanis) [14:40:29] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:41:15] (03Abandoned) 10Hashar: gerrit: Add option to enable developer auth [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox) [14:41:17] (03PS12) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [14:43:43] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:44:04] (03PS4) 10JMeybohm: echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 [14:46:16] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:46:55] (03CR) 10Ahmon Dancy: [C: 03+1] "Waiting for Jeena's approval as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 (owner: 10JMeybohm) [14:48:16] (03PS1) 10Urbanecm: foundationwiki: Restrict editing of sensitive namespaces to `editor` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) [14:48:53] (03CR) 10Urbanecm: [C: 04-2] "do not merge (yet), pending adding members to the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) (owner: 10Urbanecm) [14:50:26] (03PS3) 10BryanDavis: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm) [14:51:53] (03PS1) 10Elukey: knative-serving: improve the helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/717438 (https://phabricator.wikimedia.org/T289835) [14:52:29] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod, docs only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717067 (https://phabricator.wikimedia.org/T201491) (owner: 10Zabe) [14:52:48] (03PS13) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [14:53:38] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:53:52] (03CR) 10Elukey: "This seems to work on deploy1002 (had to test in there because I wasn't sure what worked and what not)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/717438 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:54:07] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:55:21] (03CR) 10Urbanecm: Growth: Remove config that moved on-wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm) [14:55:38] (03Merged) 10jenkins-bot: Typo fix: 'the the' -> 'the' [deployment-charts] - 10https://gerrit.wikimedia.org/r/717067 (https://phabricator.wikimedia.org/T201491) (owner: 10Zabe) [14:56:22] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 7.352 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:56:22] 10ops-codfw, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) [14:56:38] 10ops-codfw, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) p:05Triage→03Medium [14:56:44] (03CR) 10Elukey: [C: 03+2] "As always, going to merge and test this properly. Please lemme know anything weird :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717438 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:58:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:58:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:18] (03PS14) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [14:59:29] 10ops-codfw, 10DBA, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10jcrespo) [15:00:38] !log disable puppet fleet wide to preform puppetdb database maintance - T263578 [15:00:39] 10ops-codfw, 10DBA, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) @Papaul let's do that next week, which day/time would work for you? [15:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:42] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [15:12:41] 10ops-codfw, 10DBA, 10Data-Persistence: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) @Marostegui will confirm next week with day and time. Thanks. [15:14:17] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 (10Papaul) 05Open→03Resolved Complete [15:16:34] PROBLEM - Host puppetdb1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:44] (03PS1) 10Cwhite: logstash: alertmanager: add alertname and summary to labels [puppet] - 10https://gerrit.wikimedia.org/r/717441 (https://phabricator.wikimedia.org/T289356) [15:17:00] PROBLEM - Host puppetdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:20] !log create lvm snapshot puppetdb1002_data_snapshot on ganeti1012 - T263578 [15:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:26] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [15:20:38] RECOVERY - Host puppetdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [15:21:04] !log create lvm snapshot puppetdb2002_data_snapshot on ganeti2023 - T263578 [15:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:01] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) created snapshots as a roll back stratagy post vacum ` name=ganeti1012 $ sudo lvdisplay ganeti/puppetdb1002_data_snapshot... [15:22:09] (03PS1) 10Cwhite: logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356) [15:22:26] RECOVERY - Host puppetdb2002 is UP: PING OK - Packet loss = 0%, RTA = 31.80 ms [15:23:42] PROBLEM - Check systemd state on puppetdb1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:06] PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:19] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) `VACUUM FULL VERBOSE ANALYZE; ` is running on pupetdb1002 in a tmux session under my user [15:32:00] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) [15:32:39] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) The commands are hanging to disconnect this server from the cluster, so I have to reboot it in order to break the link. I've downtimed it in... [15:32:58] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.052 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:36:02] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) That was successful: ` root@labstore1004:~# drbd-overview 1:test/0 StandAlone Primary/Unknown UpToDate/DUnknown /srv/test ext4 9.8G 535M... [15:36:40] (03CR) 10Herron: rsyslog: stop saving trafficserver-tls logs to disk (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [15:40:38] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:41:11] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) The urgency is moderate here. The NFS service for the cloud (a core function of #toolforge ) is still up and should function fine. This is th... [15:42:08] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:42:26] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:43:58] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 3.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:44:04] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:44:22] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:44:28] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:46:17] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#7331105, @jbond wrote: > `VACUUM FULL VERBOSE ANALYZE; ` is running on pupetdb1002 in a tmux session under my user Well t... [15:49:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:50:19] (03PS1) 10Herron: thanos::rule: add cluster_site:sli_etcd_http_error_ratio:rate5m recording rule [puppet] - 10https://gerrit.wikimedia.org/r/717473 (https://phabricator.wikimedia.org/T289615) [15:51:53] (03PS1) 10Volans: sre.hosts.decommission: catch unhandled exception [cookbooks] - 10https://gerrit.wikimedia.org/r/717475 (https://phabricator.wikimedia.org/T290326) [15:53:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:28] !log enable puppet fleet wide to post puppetdb database maintance - T263578 [15:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:32] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [15:56:00] (03CR) 10Cwhite: [C: 03+2] logstash: alertmanager: add alertname and summary to labels [puppet] - 10https://gerrit.wikimedia.org/r/717441 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [15:57:02] (03PS2) 10Cwhite: logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356) [15:57:28] (03PS4) 10BryanDavis: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm) [16:02:44] (03PS1) 10Elukey: istio: change node port for HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/717482 (https://phabricator.wikimedia.org/T289835) [16:03:03] (03CR) 10BryanDavis: [C: 03+2] toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm) [16:06:22] (03Merged) 10jenkins-bot: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) (owner: 10Legoktm) [16:06:42] (03CR) 10Elukey: [C: 03+2] istio: change node port for HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/717482 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:08:12] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717483 [16:10:13] !log blazegraph (public cofdfw cluster) will now restart every hour - T290330 [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:18] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [16:10:20] (03CR) 10ZPapierski: [C: 03+1] query service: Fix loading of DCATAP file [puppet] - 10https://gerrit.wikimedia.org/r/715696 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse) [16:14:04] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:55] (03PS1) 10Cwhite: logstash: route aqs and restbase logs to default ecs indexes [puppet] - 10https://gerrit.wikimedia.org/r/717489 (https://phabricator.wikimedia.org/T234565) [16:18:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:18:17] (03PS1) 10Cwhite: logstash: route gitlab logs to default indexes [puppet] - 10https://gerrit.wikimedia.org/r/717490 (https://phabricator.wikimedia.org/T274462) [16:19:00] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:19:02] (03PS2) 10Cwhite: logstash: route aqs and restbase logs to default ecs indexes [puppet] - 10https://gerrit.wikimedia.org/r/717489 (https://phabricator.wikimedia.org/T234565) [16:19:06] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:19:06] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:19:18] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:25] (03PS3) 10Volans: sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) [16:20:28] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:23:57] (03CR) 10Cwhite: [C: 03+2] logstash: route aqs and restbase logs to default ecs indexes [puppet] - 10https://gerrit.wikimedia.org/r/717489 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:24:11] (03CR) 10Cwhite: [C: 03+2] logstash: route gitlab logs to default indexes [puppet] - 10https://gerrit.wikimedia.org/r/717490 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [16:26:20] (03CR) 10Volans: "addressed comments, this implies already the changes made in I8357ef4524bc3841cd45126c51479daf60f50cc2" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [16:29:46] (03PS1) 10Jgreen: Re-enable icinga monitoring on payments1008, adding check_ssl_staging [puppet] - 10https://gerrit.wikimedia.org/r/717492 (https://phabricator.wikimedia.org/T289869) [16:32:41] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:41] (03PS1) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) [16:35:33] (03CR) 10Jforrester: "I imagine we'd also want to restrict the file and template namespaces, given the impact that edits there will have on the content?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) (owner: 10Urbanecm) [16:36:36] (03CR) 10Jgreen: [C: 03+2] Re-enable icinga monitoring on payments1008, adding check_ssl_staging [puppet] - 10https://gerrit.wikimedia.org/r/717492 (https://phabricator.wikimedia.org/T289869) (owner: 10Jgreen) [16:37:08] (03PS2) 10Urbanecm: foundationwiki: Restrict editing of sensitive namespaces to `editor` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) [16:40:29] (03PS2) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) [16:40:44] (03CR) 10Urbanecm: [C: 04-2] foundationwiki: Restrict editing of sensitive namespaces to `editor` group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717437 (https://phabricator.wikimedia.org/T205350) (owner: 10Urbanecm) [16:41:02] (03CR) 10jerkins-bot: [V: 04-1] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [16:42:01] (03PS3) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) [16:42:29] (03CR) 10jerkins-bot: [V: 04-1] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [16:42:38] (03PS4) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) [16:45:48] (03PS5) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) [16:46:29] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [16:47:02] (03CR) 10ZPapierski: [C: 03+1] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [16:47:36] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [16:48:51] (03PS1) 10Urbanecm: foundationwiki: Restrict uploading to editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717497 (https://phabricator.wikimedia.org/T205350) [16:51:07] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717483 (owner: 10PipelineBot) [16:53:41] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717483 (owner: 10PipelineBot) [16:55:30] (03CR) 10Ryan Kemper: "Small issue with randomizedsecdelay syntax, reverting to fix" [puppet] - 10https://gerrit.wikimedia.org/r/717494 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [16:55:47] (03PS1) 10Ryan Kemper: Revert "wdqs: temp mitigation => restart hourly w random" [puppet] - 10https://gerrit.wikimedia.org/r/717451 [16:57:17] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wdqs: temp mitigation => restart hourly w random" [puppet] - 10https://gerrit.wikimedia.org/r/717451 (owner: 10Ryan Kemper) [16:57:46] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [17:04:57] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10hnowlan) I think we're probably not doing this for now - please reopen if you feel strongly! [17:05:07] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10hnowlan) 05Open→03Declined [17:09:19] (03PS1) 10Urbanecm: [WIP] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) [17:09:36] (03CR) 10Urbanecm: [C: 04-2] "do not merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [17:09:56] (03CR) 10Majavah: [C: 04-1] [WIP] Connect foundationwiki to SUL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [17:10:30] (03PS2) 10Urbanecm: [WIP] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) [17:10:43] (03PS1) 10Ryan Kemper: wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717508 (https://phabricator.wikimedia.org/T290330) [17:10:58] (03CR) 10Urbanecm: [C: 04-2] [WIP] Connect foundationwiki to SUL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [17:12:22] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: temp mitigation => restart hourly w random [puppet] - 10https://gerrit.wikimedia.org/r/717508 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [17:17:17] (03PS1) 10Jgreen: remove deprecated payments.frdev.wikimedia.org A record [dns] - 10https://gerrit.wikimedia.org/r/717510 [17:17:55] !log T290330 Deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/717508 across `wdqs` fleet; codfw wdqs hosts will restart on average once per hour now to address ongoing availability issues for wdqs codfw [17:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:01] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [17:18:11] (03PS1) 10Urbanecm: [beta] Enable CentralAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347) [17:20:22] (03CR) 10Jgreen: [V: 03+2 C: 03+2] remove deprecated payments.frdev.wikimedia.org A record [dns] - 10https://gerrit.wikimedia.org/r/717510 (owner: 10Jgreen) [17:21:15] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:15] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:23] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:29] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:41] (03PS2) 10Urbanecm: [beta] Enable CentralAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347) [17:21:47] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:25] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:37] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:39] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:39] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:39] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:43] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:54] ^ These are cropping up as an implementation detail, sorry for the noise, fixing now [17:23:05] (Silencing wdqs1* briefly in the meantime) [17:30:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:31:01] (03PS1) 10Ryan Kemper: wdqs: return 0 exit code for non-codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/717512 [17:32:41] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: return 0 exit code for non-codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/717512 (owner: 10Ryan Kemper) [17:35:09] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:36] !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [17:35:39] RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:45] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:01] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:43] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:57] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:57] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:57] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:01] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:19] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:21] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:37] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:40:22] !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [17:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:20] !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [17:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:27] (03CR) 10Bstorm: [C: 03+1] "This seems like a serious win for smaller images." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/717422 (owner: 10Majavah) [18:16:17] (03CR) 10Jeena Huneidi: [C: 04-1] "This appears to cause a release that is stuck in install/upgrade/rollback mode if the job doesn't succeed. There's no way to modify the re" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717066 (owner: 10JMeybohm) [18:24:54] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10LSobanski) [18:28:25] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) 05Resolved→03Open Hi again! One further question for you all; does @JMando have access to jupyter? The command `ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880` seems to open a... [18:43:43] (03CR) 10Jbond: "LGTM, not tested but i think its fine to merge then test" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [18:43:48] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [19:01:10] (03Restored) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [19:01:21] (03PS2) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) [19:03:22] (03CR) 10Nikki Nikkhoui: "@Cole would this work for my service in Cloud VPS not in deployment-prep?" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [19:04:43] !log T290330 `ryankemper@cumin1001:~$ sudo -E cumin 'P{wdqs2*}' 'sudo rm -fv /etc/cron.hourly/restart-blazegraph'` (Cleaned up manually created crons now that we have [somewhat hacky] systemd timers doing the same job) [19:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:49] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [19:06:19] (03PS1) 10BryanDavis: toolhub: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717565 [19:10:50] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10Ottomata) http://localhost:8880 [19:13:55] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [19:17:27] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1039.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:17:36] (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717565 (owner: 10BryanDavis) [19:20:40] (03Merged) 10jenkins-bot: toolhub: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/717565 (owner: 10BryanDavis) [19:23:06] (03CR) 10Cwhite: Add image suggestion api to lookup table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [19:26:45] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [19:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:12] !log krinkle@deploy1002 Started deploy [integration/docroot@6492b3d]: I48480e89e5f6 [19:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:22] !log krinkle@deploy1002 Finished deploy [integration/docroot@6492b3d]: I48480e89e5f6 (duration: 00m 10s) [19:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:16] (03PS3) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) [19:38:51] (03CR) 10Nikki Nikkhoui: "ok" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [19:45:27] (03CR) 10Cwhite: [C: 03+1] "Whether here in the lookup table or writing a rule in 20-trafficserver.conf, I don't feel strongly." [puppet] - 10https://gerrit.wikimedia.org/r/717311 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [19:50:04] (03PS1) 10Herron: set default slo field values and remove duplicates [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717584 [19:50:27] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [19:52:06] (03CR) 10Cwhite: [C: 03+1] facter networking: filter out cali/tap interfaces [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [19:52:46] (03Abandoned) 10Herron: logstash: route alertmanager alerts to logstash alerts index [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) (owner: 10Herron) [20:03:39] (03PS1) 10BryanDavis: toolhub: bump container version to 2021-09-03-195018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/717586 [20:06:49] (03PS1) 10Herron: slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) [20:11:05] (03CR) 10Nikki Nikkhoui: "Is there someone else that i could add to the patch for a +2? Or do you allow self-merging patches?" [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [20:11:37] (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container version to 2021-09-03-195018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/717586 (owner: 10BryanDavis) [20:11:49] (03CR) 10Herron: "here is the varnish dashboard preview for this patch https://grafana.wikimedia.org/dashboard/snapshot/rAagzkSklHmPEZ4qBlHsiaIU0FBTHgdH?org" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [20:13:14] (03CR) 10Cwhite: [C: 03+2] Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [20:17:34] (03PS1) 10Mholloway: Convert $wgEventStreams to be an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) [20:18:30] (03Merged) 10jenkins-bot: toolhub: bump container version to 2021-09-03-195018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/717586 (owner: 10BryanDavis) [20:19:01] (03CR) 10Cwhite: [C: 03+1] "Nit inline, but otherwise LGTM" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [20:27:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:30:07] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [20:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] (03CR) 10Ottomata: Convert $wgEventStreams to be an associative array (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway) [20:51:24] (03PS1) 10BryanDavis: toolhub: Disable crawler job in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/717595 [20:56:21] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) I am able to access with http://localhost:8880. Thank you! [20:56:45] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) 05Open→03Resolved Ah, nice. I'll go get my vision checked and close out this task. That you for correcting my numerous typos during this setup process. [20:56:46] (03PS1) 10Legoktm: nodejs-devel: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717598 (https://phabricator.wikimedia.org/T290209) [20:57:47] (03CR) 10Legoktm: [V: 03+2 C: 03+2] nodejs-devel: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717598 (https://phabricator.wikimedia.org/T290209) (owner: 10Legoktm) [21:03:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:05:52] (03CR) 10Bstorm: [C: 03+2] "I'm going to merge this one with the understanding that if anyone moves to release, we need to nag me to run my script. It looks like ther" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm) [21:06:55] (03Merged) 10jenkins-bot: Use common k8s labels [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm) [21:20:11] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:22:53] (03PS4) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) [21:22:55] (03CR) 10Ebernhardson: Add wcqs.svc.{codfw,eqiad}.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/713929 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [21:29:02] (03PS1) 10Ahmon Dancy: check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605 [21:31:51] (03PS1) 10Ebernhardson: Add cname for commons-query.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/717606 (https://phabricator.wikimedia.org/T282117) [21:34:47] (03CR) 10BryanDavis: [C: 03+2] toolhub: Disable crawler job in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/717595 (owner: 10BryanDavis) [21:37:58] (03Merged) 10jenkins-bot: toolhub: Disable crawler job in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/717595 (owner: 10BryanDavis) [21:39:11] (03PS1) 10Ahmon Dancy: mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 [21:40:47] (03CR) 10Ahmon Dancy: "This is an alternate approach to solving the problem described in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/717066" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy) [21:44:53] (03PS5) 10Ebernhardson: query_service: support multiple variants of wdqs microsite [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) [21:44:55] (03PS1) 10Ebernhardson: Deploy query_service microsite for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/717630 [21:49:46] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:56] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717644 [22:02:34] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717648 [22:05:57] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/717650 [22:07:03] (03PS1) 10Krinkle: clinic-duty: Misc JS clean ups [software] - 10https://gerrit.wikimedia.org/r/717651 [22:10:51] (03CR) 10Jeena Huneidi: [C: 03+1] "This is working without causing helm to be stuck doing a release/upgrade for me. Unfortunately when I run rake I get an error, but it's no" [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy) [22:12:36] 10SRE-Access-Requests, 10Release-Engineering-Team: Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10dancy) [22:12:56] 10SRE-Access-Requests, 10Release-Engineering-Team: Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10dancy) [22:12:59] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) [22:13:18] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) 05Stalled→03Open [22:14:36] (03CR) 10Ahmon Dancy: mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy) [22:15:41] (03PS2) 10Krinkle: clinic-duty: Misc JS clean ups [software] - 10https://gerrit.wikimedia.org/r/717651 [22:21:03] (03PS1) 10Krinkle: clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 [22:29:59] 10ops-codfw, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T290362 (10wiki_willy) [22:30:12] 10ops-codfw, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T290362 (10wiki_willy) [22:39:14] 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10wiki_willy) [22:39:38] 10ops-eqiad, 10DC-Ops: Netbox Errors in eqiad - https://phabricator.wikimedia.org/T290364 (10wiki_willy) [22:41:09] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Doing): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10dduvall) 05Open→03Resolved a:03dduvall Resolving since the docs are now up-to-date. [22:45:01] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10wiki_willy) 05Open→03Resolved a:03wiki_willy Resolving this task. After talking to Chris, we'll update the eqiad inventory after the next recycling pickup in a... [22:49:38] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10Sj) I was just dreaming of having such a backup off-cluster. Let me know when there is a half-PB of files that I can host. (Ma... [23:02:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10wiki_willy) [23:03:12] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10wiki_willy) [23:09:12] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10wiki_willy) After talking to John, the ETA to start on this is in a couple weeks - mid September [23:34:17] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:48] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10Tgr) GerritRobotComments seems to... [23:37:11] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:38:05] (03PS1) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [23:38:21] (03PS2) 10Krinkle: ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [23:40:08] (03CR) 10Legoktm: [C: 03+1] "Thanks, all LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716553 (owner: 10BryanDavis) [23:42:07] (03CR) 10Legoktm: [C: 04-1] "Why not ensure => absent /var/lib/mailman/templates and /etc/mailman in mailman::webui too?" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:44:27] (03PS2) 10Ladsgroup: mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) [23:44:46] (03CR) 10Legoktm: [C: 03+1] Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [23:45:01] (03CR) 10Ladsgroup: mailman: Drop listinfo files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:45:08] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:46:08] (03PS3) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [23:50:25] (03CR) 10Krinkle: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Package[docker-ce]' in p" [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [23:50:43] Amir1: would you happen to know what's up with puppet ^ [23:51:00] let me take a look [23:51:42] (03CR) 10Legoktm: [C: 03+1] "Remind me to merge this on Tuesday in case I forget" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:52:03] I've cherry-picked it onto integration-puppetmaster-02 [23:52:18] and then run puppet agent -tv on the new qemu-1002 instance [23:54:20] Krinkle: the lines that errors is 54 [23:54:26] which is hard-coding docker-ce [23:55:07] I assume this should be skipped somehow? [23:55:14] oh down there [23:55:18] or I'm misunderstanding you [23:55:27] Yes, no, you're absolutely right [23:55:29] this shoudl be simple [23:56:22] (03PS4) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [23:56:30] happy to be a rubber duck :D [23:56:57] (03CR) 10Ladsgroup: [C: 03+1] ci: Add 'bulleye' to docker lsbdistcodename hack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [23:57:33] not yet :) [23:58:12] (03PS5) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [23:58:52] now? :D [23:59:09] Amir1: maybe in a few days [23:59:24] cool [23:59:25] This is a lot of new stuff I'm hacking together first. [23:59:31] might change my mind etc.