[00:00:25] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:57] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1010.eqiad.wmnet --dest wdqs1009.eqiad.wmnet --reason "transferring skolemized wikidata.jnl so we can reimage wdqs1009" --blazegraph_instance blazegraph --without-lvs` on `ryankemper@cumin1001` tmux session `wdqs_1009` [00:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:01] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:09:10] PROBLEM - SSH on ores2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:43] random codfw mgmt only [00:30:56] (03PS1) 10Dzahn: add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) [00:33:06] (03CR) 10Catrope: [C: 03+2] Enable Wikisource OCR on select Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) (owner: 10Samwilson) [00:33:51] (03Merged) 10jenkins-bot: Enable Wikisource OCR on select Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698654 (https://phabricator.wikimedia.org/T283898) (owner: 10Samwilson) [00:37:18] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:698654|Enable Wikisource OCR on select Wikisources (T283898)]] (duration: 01m 31s) [00:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:23] T283898: Turning on MVP on certain Wikis - https://phabricator.wikimedia.org/T283898 [00:46:24] RECOVERY - MariaDB Replica Lag: pc2 on pc2008 is OK: OK slave_sql_lag Replication lag: 51.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:20:19] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [01:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:06] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:39:20] !log T280382 Re-enabled puppet on `wdqs1010` [02:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:25] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [02:49:17] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [02:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:24] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1010.eqiad.wmnet --dest wdqs1009.eqiad.wmnet --reason "xfer categories following reimage" --blazegraph_instance categories --without-lvs` on `ryankemper@cumin1001` tmux session `wdqs_1009` [02:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:28] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [02:55:55] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [02:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:43] !log clean up of the rest of mbox files (except arbcom) (T282303) [02:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:47] T282303: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 [02:57:05] 7GB. not too bad. [02:58:49] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [02:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:17] !log mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=aawiktionary --site-group wiktionary (T284444) [03:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:21] T284444: Mon Wiktionary not listed in "In other languages" sidebar section (interwiki panel) - https://phabricator.wikimedia.org/T284444 [03:14:54] 10SRE, 10Traffic, 10observability: Implement SLI measurement for Varnish Frontend - https://phabricator.wikimedia.org/T284576 (10lmata) p:05Triage→03High [03:21:36] 10SRE, 10Icinga, 10observability, 10serviceops: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) p:05Medium→03High Apologies i seem to have been confused. Scheduling for review. [04:01:28] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:27:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:17] 10SRE, 10observability, 10Graphite: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 (10lmata) a:03lmata [04:42:13] (03PS1) 10Marostegui: db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698907 (https://phabricator.wikimedia.org/T283235) [04:43:37] (03CR) 10Marostegui: [C: 03+2] db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698907 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [04:44:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135 to remove rev_page_id index T163532', diff saved to https://phabricator.wikimedia.org/P16330 and previous config saved to /var/cache/conftool/dbconfig/20210609-044428-marostegui.json [04:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:35] T163532: Drop index rev_page_id (rev_page, rev_id) - https://phabricator.wikimedia.org/T163532 [04:47:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Repool db1135 after dropping an index', diff saved to https://phabricator.wikimedia.org/P16331 and previous config saved to /var/cache/conftool/dbconfig/20210609-044703-root.json [04:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:14] (03PS1) 10Marostegui: install_server: Do not reimage db1182 [puppet] - 10https://gerrit.wikimedia.org/r/698908 [04:50:09] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1182 [puppet] - 10https://gerrit.wikimedia.org/r/698908 (owner: 10Marostegui) [05:02:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 50%: Repool db1135 after dropping an index', diff saved to https://phabricator.wikimedia.org/P16332 and previous config saved to /var/cache/conftool/dbconfig/20210609-050206-root.json [05:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Repool db1135 after dropping an index', diff saved to https://phabricator.wikimedia.org/P16333 and previous config saved to /var/cache/conftool/dbconfig/20210609-051710-root.json [05:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:29] (03PS7) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [05:19:52] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:15] (03PS8) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [05:26:42] (03PS5) 10Effie Mouzeli: kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 [05:32:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Repool db1135 after dropping an index', diff saved to https://phabricator.wikimedia.org/P16334 and previous config saved to /var/cache/conftool/dbconfig/20210609-053213-root.json [05:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:28] (03PS9) 10Effie Mouzeli: (WIP) add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 [05:44:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:56] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/29834/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698829 (owner: 10Effie Mouzeli) [06:00:48] (03PS2) 10Giuseppe Lavagetto: Fix undefined variables in Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/698813 [06:00:50] (03PS1) 10Giuseppe Lavagetto: mwdebug: use the newest mediawiki image [deployment-charts] - 10https://gerrit.wikimedia.org/r/698911 [06:03:00] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:03:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix undefined variables in Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/698813 (owner: 10Giuseppe Lavagetto) [06:06:30] (03Merged) 10jenkins-bot: Fix undefined variables in Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/698813 (owner: 10Giuseppe Lavagetto) [06:10:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: use the newest mediawiki image [deployment-charts] - 10https://gerrit.wikimedia.org/r/698911 (owner: 10Giuseppe Lavagetto) [06:12:43] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] setup.py: change setuptools_scm tag regex (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [06:16:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 (owner: 10JMeybohm) [06:17:51] (03CR) 10jerkins-bot: [V: 04-1] setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [06:17:53] (03CR) 10jerkins-bot: [V: 04-1] Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 (owner: 10JMeybohm) [06:19:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 (owner: 10JMeybohm) [06:22:07] (03CR) 10jerkins-bot: [V: 04-1] setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [06:22:09] (03CR) 10jerkins-bot: [V: 04-1] Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 (owner: 10JMeybohm) [06:22:11] (03CR) 10jerkins-bot: [V: 04-1] Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 (owner: 10JMeybohm) [06:25:39] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29835/" [puppet] - 10https://gerrit.wikimedia.org/r/698206 (https://phabricator.wikimedia.org/T252132) (owner: 10Ayounsi) [06:25:59] !log Add 185.71.138.0/24 to network::external and diffscan - T252132 [06:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:04] T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 [06:26:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "LGTM, although I'm starting to think we might want to have a response data structure instead of a tuple of lists." (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 (owner: 10JMeybohm) [06:27:56] (03CR) 10jerkins-bot: [V: 04-1] setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [06:27:58] (03CR) 10jerkins-bot: [V: 04-1] Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 (owner: 10JMeybohm) [06:28:00] (03CR) 10jerkins-bot: [V: 04-1] Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 (owner: 10JMeybohm) [06:28:02] (03CR) 10jerkins-bot: [V: 04-1] Don't treat nonexisting image tags as failure on delete [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 (owner: 10JMeybohm) [06:35:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10elukey) 05Resolved→03Open @Papaul the host is down again :( ` racadm>>racadm getsel Record: 1 Date/Time: Source:... [06:48:38] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:15:11] (03CR) 10Muehlenhoff: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [07:16:52] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:50] (03PS3) 10JMeybohm: setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [07:24:52] (03PS3) 10JMeybohm: Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 [07:24:54] (03PS3) 10JMeybohm: Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 [07:24:57] (03PS3) 10JMeybohm: Don't treat nonexisting image tags as failure on delete [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 [07:24:58] (03PS2) 10JMeybohm: Relase new version 0.0.12-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698769 [07:25:00] (03PS1) 10JMeybohm: Instsall missing type stubs for mypy [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698943 [07:25:08] (03PS2) 10JMeybohm: Install missing type stubs for mypy [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698943 [07:25:10] (03PS4) 10JMeybohm: setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [07:25:12] (03PS4) 10JMeybohm: Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 [07:25:14] (03PS4) 10JMeybohm: Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 [07:25:16] (03PS4) 10JMeybohm: Don't treat nonexisting image tags as failure on delete [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 [07:25:18] (03PS3) 10JMeybohm: Relase new version 0.0.12-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698769 [07:27:28] (03PS10) 10Effie Mouzeli: mcrouter: add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (https://phabricator.wikimedia.org/T284420) [07:32:13] (03CR) 10Muehlenhoff: "But I'm wondering why this patch is still needed, though: Janis removed the images in question in" [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [07:42:15] 10SRE, 10SRE-Access-Requests: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10Joe) Reading both here and the backlog in slack, I agree with @rzl: you did set a passphrase for your private key, then memorized it in Keychain. Now you'd need to type it in once again and it would be s... [07:54:16] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:39] (03CR) 10David Caro: ceph: add cookbooks to reboot osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [07:55:59] (03PS3) 10Muehlenhoff: Remove unused legacy service aliases [puppet] - 10https://gerrit.wikimedia.org/r/689857 [07:58:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused legacy service aliases [puppet] - 10https://gerrit.wikimedia.org/r/689857 (owner: 10Muehlenhoff) [08:01:06] (03PS2) 10David Caro: ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) [08:12:05] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [08:13:13] 10SRE, 10SRE-Access-Requests: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10Volans) p:05Triage→03Medium [08:15:01] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) @JMeybohm [[ https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/608889 | the patch to filter ]] still required considering https... [08:15:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ssh key for jforrester - https://phabricator.wikimedia.org/T284613 (10Volans) p:05Triage→03Medium [08:16:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ssh key for jforrester - https://phabricator.wikimedia.org/T284613 (10Volans) In an abundance of caution (also because the key is not included in the signed message ;) ) I'll double check with @Jdforrester-WMF when back online before merging it. [08:19:48] (03PS1) 10ZPapierski: Enable blank node skolemization for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/698949 (https://phabricator.wikimedia.org/T284040) [08:20:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:32] (03CR) 10Gehel: [C: 03+2] Enable blank node skolemization for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/698949 (https://phabricator.wikimedia.org/T284040) (owner: 10ZPapierski) [08:21:58] (03PS19) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [08:22:46] (03PS1) 10Volans: admin: add jgianellos and mbsantos to maps-roots [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) [08:22:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29836/console" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [08:24:13] (03Abandoned) 10Effie Mouzeli: mediawiki: Fix nutcracker port [deployment-charts] - 10https://gerrit.wikimedia.org/r/697599 (owner: 10Effie Mouzeli) [08:25:44] (03PS2) 10Effie Mouzeli: Add redis password for mw:nutcracker:redis_password [labs/private] - 10https://gerrit.wikimedia.org/r/696309 [08:27:43] (03CR) 10Jbond: [V: 03+1] "> Patch Set 18:" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [08:30:06] (03PS3) 10Elukey: [WIP] - Add the operators.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [08:30:19] (03CR) 10Muehlenhoff: admin: add jgianellos and mbsantos to maps-roots (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) (owner: 10Volans) [08:30:51] (03CR) 10Elukey: [WIP] - Add the operators.d directory with basic Istio config (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:33:49] eqiad-esams... [08:34:30] unplanned outage [08:34:34] again [08:35:25] (03PS2) 10Volans: admin: add jgianellos and mbsantos to maps-roots [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) [08:35:33] (03CR) 10Volans: "addressed comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) (owner: 10Volans) [08:36:40] (03CR) 10Jbond: [V: 03+1] "> Patch Set 18:" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [08:36:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There is an error - the proxies would never be added to the yaml output - but overall LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698829 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [08:37:08] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7144893, @jbond wrote: > @JMeybohm [[ https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/608889 | the patch to... [08:37:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one more nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) (owner: 10Volans) [08:38:54] (03PS3) 10Volans: admin: add jgianellos and mbsantos to maps-roots [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) [08:38:59] (03CR) 10Volans: "done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) (owner: 10Volans) [08:43:03] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) (owner: 10Volans) [08:43:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I misread the patch, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/698829 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [08:44:59] (03PS1) 10Jelto: icinga: let Jelto Wodstrcil (Jelto) run commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/698952 [08:49:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 (owner: 10Effie Mouzeli) [08:50:18] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698855 (https://phabricator.wikimedia.org/T284627) (owner: 10DannyS712) [08:51:39] (03PS20) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [08:52:42] 10SRE, 10Citoid, 10VisualEditor, 10Services (done): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304 (10hashar) [08:53:42] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698952 (owner: 10Jelto) [08:53:56] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) linked the wrong CR earlier i meant https://gerrit.wikimedia.org/r/698763, however assuming you saw past my error ill revert that now, thanks :) [08:54:38] (03PS1) 10Jbond: Revert "P:docker: update filter file" [puppet] - 10https://gerrit.wikimedia.org/r/698856 [08:55:42] (03CR) 10Effie Mouzeli: [C: 03+2] kubernetes::deployment_server: create a separate mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/698795 (owner: 10Effie Mouzeli) [08:56:58] (03CR) 10Muehlenhoff: [C: 03+2] eventschemas: Switch to nginx profile [puppet] - 10https://gerrit.wikimedia.org/r/698767 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [08:57:48] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter: add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [08:58:00] (03PS11) 10Effie Mouzeli: mcrouter: add mcrouter pools to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/698829 (https://phabricator.wikimedia.org/T284420) [08:58:25] (03PS20) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [08:58:37] (03PS21) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [08:58:53] (03CR) 10Ayounsi: [C: 03+1] (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [08:59:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29838/console" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [08:59:54] (03CR) 10JMeybohm: [C: 03+2] Relase new version 0.0.12-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698769 (owner: 10JMeybohm) [09:00:05] (03CR) 10JMeybohm: [C: 03+2] Install missing type stubs for mypy [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698943 (owner: 10JMeybohm) [09:02:09] (03Merged) 10jenkins-bot: Install missing type stubs for mypy [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698943 (owner: 10JMeybohm) [09:02:26] (03Merged) 10jenkins-bot: setup.py: change setuptools_scm tag regex [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698740 (owner: 10Volans) [09:02:27] (03CR) 10Volans: [C: 03+2] admin: add jgianellos and mbsantos to maps-roots [puppet] - 10https://gerrit.wikimedia.org/r/698950 (https://phabricator.wikimedia.org/T284135) (owner: 10Volans) [09:02:47] (03Merged) 10jenkins-bot: Fix output of registry on delete-tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698764 (owner: 10JMeybohm) [09:04:26] (03Merged) 10jenkins-bot: Remove duplicate log line from Chartmuseum class [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698765 (owner: 10JMeybohm) [09:04:28] (03Merged) 10jenkins-bot: Don't treat nonexisting image tags as failure on delete [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698766 (owner: 10JMeybohm) [09:04:30] (03Merged) 10jenkins-bot: Relase new version 0.0.12-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/698769 (owner: 10JMeybohm) [09:07:31] (03CR) 10Jbond: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [09:09:17] (03PS22) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [09:09:34] (03PS6) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [09:09:41] (03PS21) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [09:09:50] (03PS23) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [09:10:11] (03PS1) 10Effie Mouzeli: mcrouter::yaml_defs: fix yaml structure [puppet] - 10https://gerrit.wikimedia.org/r/698953 [09:10:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10Volans) @Jgiannelos @MSantos the patch has been merged and deployed, it will take effect on all hosts within the next 30 minutes. After that please confirm that you... [09:10:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29839/console" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [09:12:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:25] (03PS22) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [09:13:33] (03PS24) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [09:13:56] (03PS3) 10Effie Mouzeli: Add redis password for mw:nutcracker:redis_password [labs/private] - 10https://gerrit.wikimedia.org/r/696309 [09:14:22] (03CR) 10Effie Mouzeli: [C: 03+2] Add redis password for mw:nutcracker:redis_password [labs/private] - 10https://gerrit.wikimedia.org/r/696309 (owner: 10Effie Mouzeli) [09:14:30] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add redis password for mw:nutcracker:redis_password [labs/private] - 10https://gerrit.wikimedia.org/r/696309 (owner: 10Effie Mouzeli) [09:14:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29841/console" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [09:15:05] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/compiler1002/29840/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698953 (owner: 10Effie Mouzeli) [09:15:08] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter::yaml_defs: fix yaml structure [puppet] - 10https://gerrit.wikimedia.org/r/698953 (owner: 10Effie Mouzeli) [09:15:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698771 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:19:18] (03CR) 10Jbond: [V: 03+1] "> ill update that in the next PS" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [09:19:39] (03CR) 10Jbond: [C: 03+2] Revert "P:docker: update filter file" [puppet] - 10https://gerrit.wikimedia.org/r/698856 (owner: 10Jbond) [09:21:59] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10MoritzMuehlenhoff) [09:22:03] (03PS1) 10Jbond: Revert "docker-reporter: filter out old removed images" [puppet] - 10https://gerrit.wikimedia.org/r/698857 [09:23:20] (03CR) 10Jelto: [C: 03+2] icinga: let Jelto Wodstrcil (Jelto) run commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/698952 (owner: 10Jelto) [09:23:54] (03CR) 10Jbond: [C: 03+2] Revert "docker-reporter: filter out old removed images" [puppet] - 10https://gerrit.wikimedia.org/r/698857 (owner: 10Jbond) [09:24:33] jelto: is it ok to merge your icinga change? [09:27:33] jbond: its fine to merge my icinga changes, thank you [09:28:04] merged [09:28:17] jelto: you will need to run puppet on icinga for it to take affect [09:31:22] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:37:12] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 70.97 ms [09:37:33] it was pinging already when I tried from alert1001 [09:37:38] (03CR) 10Muehlenhoff: [C: 03+2] eventschemas: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698771 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:53:47] (03CR) 10Volans: [C: 03+1] "LGTM, just missing tests." (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [09:54:14] (03CR) 10Jbond: [C: 03+2] P:netbox: Add support for cas authentication provider [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [09:54:19] (03CR) 10Jbond: [C: 03+2] O:netbox::standalone: switch netbox-next to use cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/698807 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [09:58:08] !log cleanup now unused nginx mods and former deps (various X11 libs and libxslt) on schema* after switch towards nginx-light T164456 [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:13] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [09:59:27] (03PS1) 10Kormat: mariadb: Disable sustained repl lag alerts for parsercache [puppet] - 10https://gerrit.wikimedia.org/r/698954 [09:59:53] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [09:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:02] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:00:34] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29842/console" [puppet] - 10https://gerrit.wikimedia.org/r/698954 (owner: 10Kormat) [10:00:42] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 00m 48s) [10:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:57] XioNoX, topranks: it seems that mr1-ulsfo.oob IPv6 is flapping (second time it fails), can't ping from alert1001 [10:01:24] I'm doing a tracerout6 [10:01:27] Let me have a look. [10:02:36] thx, paste sent in query [10:03:32] (03CR) 10Marostegui: [C: 03+1] mariadb: Disable sustained repl lag alerts for parsercache [puppet] - 10https://gerrit.wikimedia.org/r/698954 (owner: 10Kormat) [10:03:49] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Disable sustained repl lag alerts for parsercache [puppet] - 10https://gerrit.wikimedia.org/r/698954 (owner: 10Kormat) [10:04:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 T283235', diff saved to https://phabricator.wikimedia.org/P16337 and previous config saved to /var/cache/conftool/dbconfig/20210609-100423-marostegui.json [10:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:28] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [10:05:27] (03PS1) 10Marostegui: db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698955 (https://phabricator.wikimedia.org/T283235) [10:05:32] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 72.97 ms [10:06:16] (03CR) 10Marostegui: [C: 03+2] db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698955 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [10:06:18] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [10:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:56] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 00m 38s) [10:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:42] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [10:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:09] (03PS1) 10Jbond: apereo_cas: add base_url to type signiture [puppet] - 10https://gerrit.wikimedia.org/r/698958 [10:12:24] (03CR) 10Jbond: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/698958 was required to fix issues" [puppet] - 10https://gerrit.wikimedia.org/r/698796 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [10:13:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: add base_url to type signiture [puppet] - 10https://gerrit.wikimedia.org/r/698958 (owner: 10Jbond) [10:13:24] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 05m 41s) [10:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:23] (03PS1) 10Muehlenhoff: Add dummy keytabs for apt1001/apt2001 [labs/private] - 10https://gerrit.wikimedia.org/r/698959 [10:14:25] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:18] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01003 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:15:18] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 00m 53s) [10:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:57] jbond: the puppet agent issue seems to be related to the base_url [10:18:22] \Found value has wrong type, entry 'production' unrecognized key 'base_url' Found value has wrong type, entry 'staging' unrecognized key 'base_url' [10:18:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1130.eqiad.wmnet with reason: REIMAGE [10:18:40] elukey: i just pushed out a change let me kick all the failed ones [10:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:11] ahhhh [10:19:14] lovely thanks [10:19:23] np :) [10:20:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:21:18] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytabs for apt1001/apt2001 [labs/private] - 10https://gerrit.wikimedia.org/r/698959 (owner: 10Muehlenhoff) [10:22:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1130.eqiad.wmnet with reason: REIMAGE [10:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: add cas_configuration symlink [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698792 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [10:22:36] (03PS4) 10Hnowlan: maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) [10:26:51] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0059 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:34:16] (03PS5) 10Hnowlan: maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) [10:35:33] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29843/console" [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:37:23] (03PS6) 10Jbond: cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) [10:38:07] (03CR) 10Jbond: "updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [10:38:52] (03CR) 10Volans: cumin: Add check_puppet_run_script so we can filter based on icinga status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [10:41:18] (03PS7) 10Jbond: cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) [10:41:32] (03CR) 10Jbond: "thx" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [10:45:53] (03PS1) 10Muehlenhoff: Enable apt* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698960 [10:45:56] (03CR) 10Volans: [C: 03+1] "I didn't test it this time, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [10:46:03] (03PS2) 10Muehlenhoff: Enable apt* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698960 [10:46:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698960 (owner: 10Muehlenhoff) [10:46:33] (03PS3) 10Muehlenhoff: Enable apt* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698960 [10:47:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698960 (owner: 10Muehlenhoff) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T1100). [11:00:05] DannyS712: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] I'm here [11:01:40] Happy to deploy this one. [11:02:03] (03CR) 10Daniel Kinzler: [C: 03+1] "The code looks correct, and it's something we want." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697786 (https://phabricator.wikimedia.org/T284141) (owner: 10Vlad.shapik) [11:03:26] (03CR) 10Awight: [C: 03+2] "Config deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698855 (https://phabricator.wikimedia.org/T284627) (owner: 10DannyS712) [11:03:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:04:17] (03Merged) 10jenkins-bot: Set wgAutoConfirmCount to 10 for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698855 (https://phabricator.wikimedia.org/T284627) (owner: 10DannyS712) [11:04:30] awight let me know when to test and on which debug host [11:04:52] DannyS712: will do! [11:06:00] (03CR) 10Muehlenhoff: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [11:06:35] DannyS712: Ready on mwdebug1002 [11:07:21] confirmed to work - tested at https://en.wikisource.org/wiki/Special:UserRights/DannyS712_test (more than 4 days ago and 0 edits) - shows as autoconfirmed on normal hosts, doesn't say that for mwdebug1002 [11:07:43] ty [11:10:12] (03PS1) 10Volans: Update to v2.10.4-wmf2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698962 (https://phabricator.wikimedia.org/T244849) [11:10:15] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:698855|Set wgAutoConfirmCount to 10 for enwikisource (T284627)]] (duration: 02m 04s) [11:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:20] DannyS712: Deployed :-) [11:10:21] T284627: Amend English Wikisource wgAutoConfirmCount to 10 - https://phabricator.wikimedia.org/T284627 [11:10:26] thanks for the help! [11:11:28] !log EU deployment window complete [11:11:31] no, thank you [11:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:26] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:14:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698962 (https://phabricator.wikimedia.org/T244849) (owner: 10Volans) [11:17:38] (03CR) 10Volans: [V: 03+2 C: 03+2] Update to v2.10.4-wmf2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698962 (https://phabricator.wikimedia.org/T244849) (owner: 10Volans) [11:18:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:56] (03CR) 10MSantos: [C: 03+1] maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [11:20:36] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [11:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:44] (03PS1) 10Muehlenhoff: Fix up access of datacenter-ops to apt* servers [puppet] - 10https://gerrit.wikimedia.org/r/698963 [11:21:51] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 01m 15s) [11:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:05] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [11:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698963 (owner: 10Muehlenhoff) [11:22:48] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 00m 43s) [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:00] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:27:55] !log drop keep_env from sudo config - #T275852 [11:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:00] T275852: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 [11:28:00] (03CR) 10Jbond: [C: 03+2] sudo: drop keep_env option [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) (owner: 10Jbond) [11:31:16] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: re-try [11:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:15] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: re-try (duration: 00m 59s) [11:32:20] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: re-try [11:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:43] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: re-try (duration: 02m 23s) [11:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:06] (03PS1) 10Muehlenhoff: Extend installserver Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/698967 [11:35:33] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: redeploy HEAD~1 [11:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:27] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: redeploy HEAD~1 (duration: 00m 54s) [11:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:25] !log jbond@deploy1002 Started deploy [netbox/deploy@c70df91]: redeploy HEAD~1 [11:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:14] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:40:21] !log jbond@deploy1002 Finished deploy [netbox/deploy@c70df91]: redeploy HEAD~1 (duration: 01m 55s) [11:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:10] (03PS1) 10Jbond: netbox: ignore cas_configueration.py [software/netbox] - 10https://gerrit.wikimedia.org/r/698968 [11:45:42] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [11:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:35] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 00m 53s) [11:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:48] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:46:55] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [11:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:00] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 00m 05s) [11:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:36] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [11:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:45] looks like parsoid has a sudden spike of errors ^ [11:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P16341 and previous config saved to /var/cache/conftool/dbconfig/20210609-114944-marostegui.json [11:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:52] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 02m 16s) [11:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:03] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [11:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:18] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:51:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Repool db1141 after schema change', diff saved to https://phabricator.wikimedia.org/P16342 and previous config saved to /var/cache/conftool/dbconfig/20210609-115104-root.json [11:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:31] (03PS1) 10Urbanecm: WelcomeSurveyExperimentalGroups: Use new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698969 (https://phabricator.wikimedia.org/T284599) [11:53:14] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 03m 11s) [11:53:15] I'm going to deploy a quick fix for Growth experiments [11:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:21] (03CR) 10Urbanecm: [C: 03+2] WelcomeSurveyExperimentalGroups: Use new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698969 (https://phabricator.wikimedia.org/T284599) (owner: 10Urbanecm) [11:53:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:54:05] (03Merged) 10jenkins-bot: WelcomeSurveyExperimentalGroups: Use new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698969 (https://phabricator.wikimedia.org/T284599) (owner: 10Urbanecm) [11:54:07] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [11:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:48] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 00m 41s) [11:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:34] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:57:25] (03PS1) 10Urbanecm: WelcomeSurveyExperimentalGroups: Explicitly set exp1_group1 percentage to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698970 (https://phabricator.wikimedia.org/T284599) [11:57:27] (03CR) 10Urbanecm: [C: 03+2] WelcomeSurveyExperimentalGroups: Explicitly set exp1_group1 percentage to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698970 (https://phabricator.wikimedia.org/T284599) (owner: 10Urbanecm) [11:58:19] (03Merged) 10jenkins-bot: WelcomeSurveyExperimentalGroups: Explicitly set exp1_group1 percentage to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698970 (https://phabricator.wikimedia.org/T284599) (owner: 10Urbanecm) [11:58:24] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [11:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 00m 54s) [11:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:30] (03PS1) 10Ssingh: Add doh4001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698971 (https://phabricator.wikimedia.org/T283503) [12:00:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ac43baa: d185728: WelcomeSurveyExperimentalGroups: Use new syntax (T284599) (duration: 01m 19s) [12:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:00] T284599: [BUG] Welcome survey is not being shown to a new account in our experiment - https://phabricator.wikimedia.org/T284599 [12:01:32] (03PS1) 10Ssingh: acme_chief: authorize doh4001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698972 (https://phabricator.wikimedia.org/T284349) [12:03:05] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29844/console" [puppet] - 10https://gerrit.wikimedia.org/r/698972 (https://phabricator.wikimedia.org/T284349) (owner: 10Ssingh) [12:03:06] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [12:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:11] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 00m 06s) [12:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet [12:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:23] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps2009.codfw.wmnet with reason: Postgis version juggling [12:05:23] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps2009.codfw.wmnet with reason: Postgis version juggling [12:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Repool db1141 after schema change', diff saved to https://phabricator.wikimedia.org/P16343 and previous config saved to /var/cache/conftool/dbconfig/20210609-120608-root.json [12:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:20] !log stopped tilerator on maps2009 [12:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:51] (03PS1) 10Ssingh: site: switch doh4001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698973 (https://phabricator.wikimedia.org/T284349) [12:09:21] !log running `nodetool decommission` on maps2009 [12:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:27] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: (no justification provided) [12:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:12] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: (no justification provided) (duration: 00m 44s) [12:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:36] (03PS1) 10Muehlenhoff: wdqs: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) [12:11:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:12:18] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:29] (03PS2) 10Ssingh: Add doh4001 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/698971 (https://phabricator.wikimedia.org/T283503) [12:13:11] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 00m 53s) [12:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:41] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [12:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P16344 and previous config saved to /var/cache/conftool/dbconfig/20210609-121501-marostegui.json [12:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Repool db1143 after schema change', diff saved to https://phabricator.wikimedia.org/P16345 and previous config saved to /var/cache/conftool/dbconfig/20210609-121603-root.json [12:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:19] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 03m 38s) [12:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Repool db1141 after schema change', diff saved to https://phabricator.wikimedia.org/P16347 and previous config saved to /var/cache/conftool/dbconfig/20210609-122111-root.json [12:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] (03PS2) 10Muehlenhoff: wdqs: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) [12:26:24] Amir1: not sure if you are around but on lists1001 there's a stale mailman_queues.prom file reported, known ? [12:26:47] godog: I'm always around [12:26:49] let me see [12:27:26] lolz [12:27:31] thank you Amir1 [12:27:57] godog: can I have more details? Can't say for sure. It might be because we shut down mm2 and those files don't get changed anymore. Needing cleanup [12:28:40] godog: once a wise man said https://bash.toolforge.org/quip/AU9RX_pt1oXzWjit5Sf- [12:28:51] lol [12:29:29] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10fgiunchedi) 05Resolved→03Open reopening as I noticed the battery status is still recharging, though I think that's unusual and surely should have been charged by now? ` Cache Board Present... [12:29:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:30:16] hahaha [12:30:50] Amir1: ok I'll take a look shortly [12:31:01] let me know [12:31:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Repool db1143 after schema change', diff saved to https://phabricator.wikimedia.org/P16348 and previous config saved to /var/cache/conftool/dbconfig/20210609-123106-root.json [12:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:07] Amir1: yeah that's it, it is due to mm2 shutdown, I'll just remove the file [12:33:20] cool. Thanks. [12:33:26] !log lists1001:rm /var/lib/prometheus/node.d/mailman_queues.prom [12:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:20] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet [12:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Repool db1141 after schema change', diff saved to https://phabricator.wikimedia.org/P16349 and previous config saved to /var/cache/conftool/dbconfig/20210609-123615-root.json [12:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:55] (03PS3) 10Muehlenhoff: wdqs: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) [12:39:14] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [12:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:39:55] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 00m 41s) [12:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:51] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [12:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:38] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 00m 47s) [12:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:46] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [12:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:54] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 01m 08s) [12:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:57] !log jbond@deploy1002 Started deploy [netbox/deploy@98cf8df]: (no justification provided) [12:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:25] !log jbond@deploy1002 Finished deploy [netbox/deploy@98cf8df]: (no justification provided) (duration: 00m 28s) [12:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:43] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: authorize doh4001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698972 (https://phabricator.wikimedia.org/T284349) (owner: 10Ssingh) [12:44:02] (03PS4) 10Muehlenhoff: wdqs: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) [12:44:07] jbond: deploying something? :) [12:44:29] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10fgiunchedi) No change, what do you think @papaul ? [12:44:43] XioNoX: fighting scap :( [12:45:04] its winning [12:46:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Repool db1143 after schema change', diff saved to https://phabricator.wikimedia.org/P16350 and previous config saved to /var/cache/conftool/dbconfig/20210609-124610-root.json [12:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:23] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: roll back to HEAD~1 [12:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:29] 10SRE, 10User-MoritzMuehlenhoff: debmonitor-client: urllib2 deprecation warning on Bullseye - https://phabricator.wikimedia.org/T284647 (10MoritzMuehlenhoff) [12:46:37] 10SRE, 10User-MoritzMuehlenhoff: debmonitor-client: urllib2 deprecation warning on Bullseye - https://phabricator.wikimedia.org/T284647 (10MoritzMuehlenhoff) p:05Triage→03Low [12:46:41] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [12:47:17] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: roll back to HEAD~1 (duration: 00m 53s) [12:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:06] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10Papaul) We should try another BBU and see @fgiunchedi [12:50:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet [12:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [12:55:28] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10fgiunchedi) >>! In T283401#7145629, @Papaul wrote: > We should try another BBU and see @fgiunchedi Sounds good to me, feel free to power down the host at your convenience [12:57:28] (03CR) 10Ssingh: [V: 03+1 C: 03+2] acme_chief: authorize doh4001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698972 (https://phabricator.wikimedia.org/T284349) (owner: 10Ssingh) [12:58:15] (03CR) 10David Caro: [C: 03+2] ceph: add cookbooks to reboot osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [12:58:20] (03CR) 10David Caro: [C: 04-1] ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [13:01:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Repool db1143 after schema change', diff saved to https://phabricator.wikimedia.org/P16351 and previous config saved to /var/cache/conftool/dbconfig/20210609-130114-root.json [13:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:13] (03PS1) 10Muehlenhoff: dumps::distribution::server: Switch to -full flavour [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) [13:03:30] (03PS2) 10Muehlenhoff: dumps::distribution::server: Switch to -full flavour [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) [13:04:18] (03CR) 10David Caro: [C: 04-1] ceph: add cookbooks to reboot osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [13:04:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698976 (https://phabricator.wikimedia.org/T164454) (owner: 10Muehlenhoff) [13:05:54] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: test master with 698968 [13:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) I email Dell with the last update. [13:07:09] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: test master with 698968 (duration: 01m 14s) [13:07:11] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: test master with 698968 [13:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:21] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: test master with 698968 (duration: 00m 10s) [13:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:36] !log jbond@deploy1002 Started deploy [netbox/deploy@f94ce0f]: test master with 698968 [13:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:56] (03PS1) 10Kormat: mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) [13:10:02] !log jbond@deploy1002 Finished deploy [netbox/deploy@f94ce0f]: test master with 698968 (duration: 02m 26s) [13:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:31] (03CR) 10Kormat: [C: 04-2] "Don't merge before maintenance window." [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) (owner: 10Kormat) [13:11:42] (03PS1) 10Kormat: wmnet: Update s3-master to db1157 [dns] - 10https://gerrit.wikimedia.org/r/698982 (https://phabricator.wikimedia.org/T284648) [13:12:03] (03CR) 10Kormat: [C: 04-2] "Don't merge before maintenance window." [dns] - 10https://gerrit.wikimedia.org/r/698982 (https://phabricator.wikimedia.org/T284648) (owner: 10Kormat) [13:12:34] !log installing nginx security updates [13:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:54] 10SRE, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10ayounsi) LibreNMS doesn't expose ifOutDiscards in its alert criteria so I had to write a custom SQL alert. `lang=sql SELECT distinct hostname FROM devices,ports,ports_statistics WHERE (ports.... [13:16:13] 10SRE, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10ayounsi) a:03ayounsi [13:18:28] (03PS1) 10Muehlenhoff: configcluster: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698984 (https://phabricator.wikimedia.org/T164456) [13:18:54] RECOVERY - SSH on ores2005.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:57] (03CR) 10David Caro: [C: 04-1] ceph: add cookbooks to reboot osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [13:20:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698984 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:21:48] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) (owner: 10Kormat) [13:21:54] * elukey sees kormat giving -2 to kormat, I like it [13:22:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] netbox: ignore cas_configueration.py [software/netbox] - 10https://gerrit.wikimedia.org/r/698968 (owner: 10Jbond) [13:22:32] elukey: some days i just gotta protect the world... from me [13:22:59] kormat: very appreciated :D [13:26:04] (03CR) 10Jbond: [C: 03+2] cumin: Add check_puppet_run_script so we can filter based on icinga status [puppet] - 10https://gerrit.wikimedia.org/r/649933 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [13:27:23] (03PS5) 10Jbond: O:cluster::managment: move monitoring from puppetdb to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/675805 [13:27:53] (03PS2) 10Muehlenhoff: configcluster: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698984 (https://phabricator.wikimedia.org/T164456) [13:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P16354 and previous config saved to /var/cache/conftool/dbconfig/20210609-132958-marostegui.json [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:30] 10SRE, 10Security, 10User-jbond: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10jbond) 05Open→03Resolved a:03jbond config has now been updated to remove keep_env [13:30:45] (03CR) 10Jbond: [C: 03+2] O:cluster::managment: move monitoring from puppetdb to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/675805 (owner: 10Jbond) [13:31:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698984 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:32:07] (03PS1) 10Volans: Update to v2.10.4-wmf3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698986 (https://phabricator.wikimedia.org/T244849) [13:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool db1166 after schema change', diff saved to https://phabricator.wikimedia.org/P16355 and previous config saved to /var/cache/conftool/dbconfig/20210609-133257-root.json [13:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698986 (https://phabricator.wikimedia.org/T244849) (owner: 10Volans) [13:33:58] (03PS1) 10David Caro: icinga.icinga_hosts: use bash wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/698987 (https://phabricator.wikimedia.org/T281248) [13:34:12] (03PS3) 10David Caro: ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) [13:34:14] (03PS1) 10David Caro: wmcs: Fixed docstring on CephController.get_nodes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698988 [13:34:16] (03PS1) 10David Caro: wmcs.ceph: Add cookbook to reboot mons [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698989 (https://phabricator.wikimedia.org/T281248) [13:40:27] (03CR) 10jerkins-bot: [V: 04-1] icinga.icinga_hosts: use bash wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/698987 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [13:40:34] (03CR) 10Muehlenhoff: [C: 03+2] Extend installserver Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/698967 (owner: 10Muehlenhoff) [13:41:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698960 (owner: 10Muehlenhoff) [13:42:33] (03CR) 10David Caro: "There's some (most probably unrelated) error with sphinx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/698987 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [13:43:08] dcaro: yes I'm aware, sphinx 4 did break compatibility I just didn't had time to send patches to all projects with the upper limit on the sphinx version [13:43:14] currently in meetings, I can try later today [13:43:31] and it was released few days ago [13:45:08] (03PS2) 10Ssingh: site: switch doh4001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698973 (https://phabricator.wikimedia.org/T284349) [13:46:05] (03CR) 10Ssingh: [C: 03+2] site: switch doh4001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698973 (https://phabricator.wikimedia.org/T284349) (owner: 10Ssingh) [13:46:27] volans: ack, no rush [13:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool db1166 after schema change', diff saved to https://phabricator.wikimedia.org/P16356 and previous config saved to /var/cache/conftool/dbconfig/20210609-134800-root.json [13:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:32] (03PS1) 10Ottomata: Migrate LandingPageImpression to Event Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698993 (https://phabricator.wikimedia.org/T282855) [13:52:10] (03CR) 10Ottomata: [C: 03+1] Migrate WMDEBanner* schemas to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698811 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [13:53:59] (03PS4) 10Ssingh: site: add wikidough eqiad with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698505 (https://phabricator.wikimedia.org/T284348) [13:54:05] (03PS2) 10Ottomata: Migrate WMDEBanner* schemas to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698811 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [13:54:50] !log Add Routinator 3000 0.9.0 to the APT repo - T282469 [13:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:54] T282469: routinator: create garbage collection job - https://phabricator.wikimedia.org/T282469 [13:56:26] !log upgrade Routinator 3000 to 0.9.0 on rpki1001 - T282469 [13:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:11] (03CR) 10Ottomata: [C: 03+2] Migrate WMDEBanner* schemas to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698811 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [13:58:24] (03PS1) 10Jbond: admin::jbond: add s_client function [puppet] - 10https://gerrit.wikimedia.org/r/698994 [13:58:49] (03PS2) 10Ottomata: Migrate LandingPageImpression to Event Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698993 (https://phabricator.wikimedia.org/T282855) [13:59:33] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate WMDEBanner* schemas to EventPlatform on testwiki - T282562 (duration: 01m 08s) [13:59:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:40] T282562: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 [14:01:03] 10SRE, 10netops: routinator: create garbage collection job - https://phabricator.wikimedia.org/T282469 (10ayounsi) 05Open→03Resolved All done! I didn't use `--fresh` but we can if needed in the future, but in theory inodes shouldn't grow in a dangerous way anymore. [14:01:08] (03CR) 10Ottomata: [C: 03+2] Migrate LandingPageImpression to Event Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698993 (https://phabricator.wikimedia.org/T282855) (owner: 10Ottomata) [14:03:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool db1166 after schema change', diff saved to https://phabricator.wikimedia.org/P16357 and previous config saved to /var/cache/conftool/dbconfig/20210609-140304-root.json [14:05:10] (03PS6) 10Hnowlan: maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) [14:05:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:08:02] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:07] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2009.codfw.wmnet [14:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:13] !log hnowlan@puppetmaster1001 conftool action : set/weight=0; selector: name=maps2009.codfw.wmnet [14:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:22] (03PS1) 10Mforns: Migrate WMDEBanner* schemas to EventPlatform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698996 (https://phabricator.wikimedia.org/T282562) [14:13:50] (03PS1) 10Jbond: cfssl::cert: Improve check for generating chained file [puppet] - 10https://gerrit.wikimedia.org/r/698997 [14:14:19] (03CR) 10Jbond: [C: 03+2] admin::jbond: add s_client function [puppet] - 10https://gerrit.wikimedia.org/r/698994 (owner: 10Jbond) [14:16:59] (03CR) 10Jbond: [C: 03+2] cfssl::cert: Improve check for generating chained file [puppet] - 10https://gerrit.wikimedia.org/r/698997 (owner: 10Jbond) [14:18:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool db1166 after schema change', diff saved to https://phabricator.wikimedia.org/P16358 and previous config saved to /var/cache/conftool/dbconfig/20210609-141807-root.json [14:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:07] (03PS1) 10Ottomata: Migrate LandingPageImpression to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698998 (https://phabricator.wikimedia.org/T282855) [14:20:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [14:21:10] (03CR) 10Ottomata: [C: 03+2] Migrate LandingPageImpression to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698998 (https://phabricator.wikimedia.org/T282855) (owner: 10Ottomata) [14:22:43] (03PS1) 10Jbond: cfss::cert: drop refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/698999 [14:23:02] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate LandingPageImpression schema to EventPlatform on testwiki - T282855 (duration: 01m 07s) [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:08] T282855: LandingPageImpression Event Platform Migration - https://phabricator.wikimedia.org/T282855 [14:23:19] (03CR) 10jerkins-bot: [V: 04-1] cfss::cert: drop refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/698999 (owner: 10Jbond) [14:23:57] (03PS2) 10Jbond: cfss::cert: drop refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/698999 [14:25:33] (03PS1) 10Ottomata: Migrate LandingPageImpression to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699000 (https://phabricator.wikimedia.org/T282855) [14:26:45] (03CR) 10Jbond: [C: 03+2] cfss::cert: drop refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/698999 (owner: 10Jbond) [14:26:47] (03PS7) 10Hnowlan: maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) [14:27:45] (03CR) 10Ottomata: [C: 03+2] Migrate LandingPageImpression to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699000 (https://phabricator.wikimedia.org/T282855) (owner: 10Ottomata) [14:28:22] 10SRE, 10SRE-tools, 10User-MoritzMuehlenhoff: debmonitor-client: urllib2 deprecation warning on Bullseye - https://phabricator.wikimedia.org/T284647 (10Volans) a:03Volans [14:28:57] (03CR) 10Volans: [V: 03+2 C: 03+2] Update to v2.10.4-wmf3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/698986 (https://phabricator.wikimedia.org/T244849) (owner: 10Volans) [14:30:39] (03CR) 10Volans: [C: 03+1] "Ack, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698963 (owner: 10Muehlenhoff) [14:32:25] (03PS2) 10Ottomata: Migrate WMDEBanner* schemas to EventPlatform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698996 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [14:33:06] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate LandingPageImpression schema to EventPlatform on all wikis - T282855 (duration: 01m 06s) [14:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:10] T282855: LandingPageImpression Event Platform Migration - https://phabricator.wikimedia.org/T282855 [14:35:23] (03CR) 10Ottomata: [C: 03+2] Migrate WMDEBanner* schemas to EventPlatform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698996 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [14:36:57] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate WMDEBanner* schemas to EventPlatform on all wikis - T282562 (duration: 01m 06s) [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:06] T282562: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 [14:42:16] (03PS1) 10Ottomata: Finalize backend EP migration of 4 EL schemas [puppet] - 10https://gerrit.wikimedia.org/r/699002 (https://phabricator.wikimedia.org/T282855) [14:43:56] (03PS1) 10Andrew Bogott: Try to disable paging for cloudceph nodes in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/699004 [14:45:33] !log installing postgresql 9.6 security updates on stretch [14:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] (03CR) 10Bstorm: [C: 03+1] "I think this will do what we want. The default is "admins" or the ops feed. We may want to add the `wmcs-bots` contact group to codfw site" [puppet] - 10https://gerrit.wikimedia.org/r/699004 (owner: 10Andrew Bogott) [14:49:20] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/699004 (owner: 10Andrew Bogott) [14:50:50] !log volans@deploy1002 Started deploy [netbox/deploy@91fd299]: Release v2.10.4-wmf3 to netbox-next.w.o [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] !log volans@deploy1002 Finished deploy [netbox/deploy@91fd299]: Release v2.10.4-wmf3 to netbox-next.w.o (duration: 00m 15s) [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:07] (03PS1) 10Arturo Borrero Gonzalez: hieradata: disable icinga alerts for ceph hosts @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/699026 [14:52:35] (03PS1) 10Jbond: P:cumin::master: change permission of config file [puppet] - 10https://gerrit.wikimedia.org/r/699027 (https://phabricator.wikimedia.org/T268211) [14:53:06] (03CR) 10Bstorm: "Wouldn't it be more useful to get notifications that only go to IRC with wmcs-bots?" [puppet] - 10https://gerrit.wikimedia.org/r/699026 (owner: 10Arturo Borrero Gonzalez) [14:53:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29846/console" [puppet] - 10https://gerrit.wikimedia.org/r/699027 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [14:53:38] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:53:53] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/699026 (owner: 10Arturo Borrero Gonzalez) [14:54:16] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @wiki_willy The Raritan PDU uses another type of sensor then the Chartsworth, we will have to ask Rahi to send us Raritan smart snesor. Thanks [14:56:04] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1002/29847/" [puppet] - 10https://gerrit.wikimedia.org/r/699026 (owner: 10Arturo Borrero Gonzalez) [14:57:35] !log volans@deploy1002 Started deploy [netbox/deploy@91fd299]: Release v2.10.4-wmf3 to netbox-next.w.o [14:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:40] !log volans@deploy1002 Finished deploy [netbox/deploy@91fd299]: Release v2.10.4-wmf3 to netbox-next.w.o (duration: 00m 04s) [14:57:41] (03CR) 10Zfilipin: selenium: Upgrade WebdriverIO to v7 (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [14:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:02] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:00:54] !log volans@deploy1002 Started deploy [netbox/deploy@91fd299]: Release v2.10.4-wmf3 to netbox-next.w.o [15:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:49] !log volans@deploy1002 Finished deploy [netbox/deploy@91fd299]: Release v2.10.4-wmf3 to netbox-next.w.o (duration: 00m 55s) [15:01:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Try to disable paging for cloudceph nodes in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/699004 (owner: 10Andrew Bogott) [15:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: disable icinga alerts for ceph hosts @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/699026 (owner: 10Arturo Borrero Gonzalez) [15:02:10] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2009.codfw.wmnet with reason: Rebuilding as buster master [15:02:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2009.codfw.wmnet with reason: Rebuilding as buster master [15:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:58] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 137093999856 and 138816 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:08:42] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:08:43] !log restarting acme-chief on acmechief1001 [15:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:18] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-metrics-collector.service,wmf_auto_restart_cassandra-metrics-collector.service Hnowlan Known issue - patch in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:06] (03PS8) 10Hnowlan: maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) [15:14:14] (03CR) 10Hnowlan: [C: 03+2] maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:14:35] (03PS1) 10Volans: doc: use add_css_file() instead of add_stylesheet() [software/spicerack] - 10https://gerrit.wikimedia.org/r/699031 [15:14:37] (03PS1) 10Volans: doc: fix parameter type in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/699032 [15:14:46] (03CR) 10Muehlenhoff: [C: 03+2] Fix up access of datacenter-ops to apt* servers [puppet] - 10https://gerrit.wikimedia.org/r/698963 (owner: 10Muehlenhoff) [15:18:41] (03CR) 10Ahmon Dancy: [C: 03+2] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [15:18:46] (03CR) 10David Caro: [C: 04-1] "Deleted a bit too much xd" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/699032 (owner: 10Volans) [15:18:56] (03CR) 10jerkins-bot: [V: 04-1] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [15:19:36] (03CR) 10Volans: "replied to comments" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/699032 (owner: 10Volans) [15:19:38] (03PS12) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [15:21:24] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T283518 (10Cmjohnson) 05Open→03Resolved fixed [15:22:14] (03CR) 10Ahmon Dancy: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [15:22:27] (03CR) 10Ahmon Dancy: [C: 03+2] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [15:25:15] (03CR) 10David Caro: [C: 03+1] doc: fix parameter type in docstring (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/699032 (owner: 10Volans) [15:28:41] (03Merged) 10jenkins-bot: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [15:29:12] (03CR) 10Jbond: [C: 03+1] doc: use add_css_file() instead of add_stylesheet() [software/spicerack] - 10https://gerrit.wikimedia.org/r/699031 (owner: 10Volans) [15:29:27] (03CR) 10Volans: [C: 03+2] doc: use add_css_file() instead of add_stylesheet() [software/spicerack] - 10https://gerrit.wikimedia.org/r/699031 (owner: 10Volans) [15:30:26] (03CR) 10Volans: [C: 03+2] doc: fix parameter type in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/699032 (owner: 10Volans) [15:36:08] (03Merged) 10jenkins-bot: doc: use add_css_file() instead of add_stylesheet() [software/spicerack] - 10https://gerrit.wikimedia.org/r/699031 (owner: 10Volans) [15:36:12] (03Merged) 10jenkins-bot: doc: fix parameter type in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/699032 (owner: 10Volans) [15:37:05] !log rebuilding maps2009 as buster master [15:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:18] (03PS2) 10Volans: icinga.icinga_hosts: use bash wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/698987 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [15:38:58] (03CR) 10Volans: [C: 03+2] "Thanks for the patch. LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/698987 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [15:46:16] (03Merged) 10jenkins-bot: icinga.icinga_hosts: use bash wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/698987 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [15:47:13] (03CR) 10Volans: "comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699027 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [15:52:53] (03PS1) 10Ahmon Dancy: Update plugins for Gerrit 3.2.10 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 [15:55:25] (03CR) 10Volans: [C: 03+2] "Verified with James" [puppet] - 10https://gerrit.wikimedia.org/r/698877 (https://phabricator.wikimedia.org/T284613) (owner: 10Jforrester) [15:55:32] (03PS3) 10Volans: admin: Add second SSH key for jforrester [puppet] - 10https://gerrit.wikimedia.org/r/698877 (https://phabricator.wikimedia.org/T284613) (owner: 10Jforrester) [16:00:51] (03PS1) 10Jdlrobson: Drop description on beta labs test survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699037 (https://phabricator.wikimedia.org/T257695) [16:03:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ssh key for jforrester - https://phabricator.wikimedia.org/T284613 (10Volans) Patch merged and deployed to all bastions, will be everywhere within ~30 minutes. Let us know if all works fine and feel free to resolve this task if there are no issues. [16:05:37] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [16:10:47] (03PS1) 10Ahmon Dancy: Upgrade Gerrit to v3.2.10 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699038 [16:10:57] (03PS1) 10Cwhite: logstash: transition openstack to ECS [puppet] - 10https://gerrit.wikimedia.org/r/699039 (https://phabricator.wikimedia.org/T234565) [16:14:33] (03PS2) 10Ahmon Dancy: Update plugins for Gerrit 3.2.10 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 [16:20:00] 10SRE, 10LDAP-Access-Requests: Access request to superset for user lzaman - https://phabricator.wikimedia.org/T284249 (10LZaman) 05Open→03Resolved [16:26:00] PROBLEM - kartotherian endpoints health on maps2009 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:30:49] (03PS4) 10Elukey: [WIP] - Add the custom_deploy.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [16:32:25] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2009 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) Hnowlan New buster host - import not complete https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:35:41] !log import docker-report 0.0.12 into buster-wikimedia [16:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:13] (03PS1) 10Hnowlan: osm: create missing imposm directories [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) [16:37:29] (03PS1) 10Volans: netbox: fix CAS configuration [puppet] - 10https://gerrit.wikimedia.org/r/699045 [16:39:10] (03CR) 10Jbond: [C: 03+2] netbox: fix CAS configuration [puppet] - 10https://gerrit.wikimedia.org/r/699045 (owner: 10Volans) [16:41:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10Cmjohnson) @ayounsi I got sidetracked with decom servers. I will ping you on IRC when I want to do this. [16:41:36] (03CR) 10Razzi: [C: 03+1] "@Elukey think you could merge this before next Tuesday, or would you like me to?" [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [16:41:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) No update yet, looking at returning the server still [16:42:47] (03CR) 10Elukey: "Yep I can do it but if you have time please go ahead :)" [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [16:42:50] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) The MW are almost ready, once we can get a batch of these online, the mw servers in A7 will be able to be decommissioned [16:43:17] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10srodlund) @Legoktm I moved this over to the blog and prepped it for publication. There were just a few minor issues with style and... [16:43:28] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:23] 10SRE, 10Goal: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10Eevans) [16:44:36] 10SRE, 10Patch-For-Review, 10Platform Engineering (Icebox): enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) [16:46:56] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-metrics-collector.service,imposm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:00] 10SRE, 10Platform Engineering (Icebox): New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10Eevans) [16:47:54] (03PS21) 10DCausse: rdf-streaming-updater: switch to H/A session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [16:47:56] (03PS6) 10DCausse: Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 [16:47:58] (03PS5) 10DCausse: rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 [16:48:51] (03CR) 10DCausse: rdf-streaming-updater: switch to H/A session-cluster (0317 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [16:49:16] PROBLEM - cassandra CQL 10.192.16.107:9042 on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:49:57] (03CR) 10jerkins-bot: [V: 04-1] Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 (owner: 10DCausse) [16:50:49] 10SRE, 10Cassandra, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (next): Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10Eevans) [16:51:17] (03PS2) 10Hnowlan: osm: create missing imposm directories, add mirror support to import [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) [16:51:19] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: switch to H/A session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [16:51:21] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 (owner: 10DCausse) [16:51:36] PROBLEM - cassandra service on maps2009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:54:10] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-metrics-collector.service,imposm.service Hnowlan new buster maps master- not in use yet https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:10] ACKNOWLEDGEMENT - cassandra CQL 10.192.16.107:9042 on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 9042: Connection refused Hnowlan new buster maps master- not in use yet https://phabricator.wikimedia.org/T93886 [16:54:10] ACKNOWLEDGEMENT - cassandra service on maps2009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan new buster maps master- not in use yet https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:54:20] (03CR) 10RLazarus: [C: 03+2] logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [16:58:46] RECOVERY - cassandra service on maps2009 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:00:00] RECOVERY - cassandra CQL 10.192.16.107:9042 on maps2009 is OK: TCP OK - 0.032 second response time on 10.192.16.107 port 9042 https://phabricator.wikimedia.org/T93886 [17:07:02] 10SRE, 10SRE-Access-Requests: Add ssh key for jforrester - https://phabricator.wikimedia.org/T284613 (10Volans) 05Open→03Resolved a:03Volans Verified with @Jdforrester-WMF that the new key works as expected. resolving. [17:08:18] (03Abandoned) 10Jforrester: docker-reporter: Ignore dropped image releng/node10-kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/698883 (owner: 10Jforrester) [17:16:11] !log updated python3-docker-report to 0.0.12 on chartmuseum2001.codfw.wmnet,chartmuseum1001.eqiad.wmnet,deneb.codfw.wmnet,registry[2003-2008].codfw.wmnet,registry[1003-1004].eqiad.wmnet [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:57] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2009.codfw.wmnet with reason: Rebuilding as buster master [17:29:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2009.codfw.wmnet with reason: Rebuilding as buster master [17:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:17] !log aborrero@cumin1001 START - Cookbook sre.hosts.remove-downtime for cloudmetrics1002.eqiad.wmnet [17:32:18] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudmetrics1002.eqiad.wmnet [17:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:48] (03CR) 10JMeybohm: [C: 03+1] Add knative serving and net-istio images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [17:34:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10aborrero) Update: I just checked the server -- seems fine. We decided to remove the icinga downtime and see if we detect any... [17:37:14] (03CR) 10JMeybohm: [C: 04-1] Add tokens and users for tegola-vector-tiles (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/693924 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [17:37:20] (03CR) 10JMeybohm: [C: 03+1] Add tokens and users for tegola-vector-tiles [puppet] - 10https://gerrit.wikimedia.org/r/692669 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [17:37:23] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Thanks @Papaul. So in terms of feedback for Raritan, so far it's: - convert PDU to one row of plugs (instead of 2 rows) - request a Raritan smart s... [17:40:08] (03CR) 10JMeybohm: [C: 03+1] Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [17:41:54] 10SRE, 10SRE-tools, 10User-MoritzMuehlenhoff: debmonitor-client: urllib3 deprecation warning on Bullseye - https://phabricator.wikimedia.org/T284647 (10Volans) [17:46:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "+1 with some minor comments that you can ignore, or fix in a follow up patch." (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [17:47:28] (03CR) 10JMeybohm: [C: 04-1] "kube-rbac-proxy's README says: "In Kubernetes clusters without NetworkPolicies any Pod can perform requests to every other Pod in the clus" (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [17:52:32] !log krinkle@mwmaint1002$ mwscript deleteEqualMessages.php --wiki rmywiki [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:42] Amir1: ^ after you :) [17:54:10] lol [17:55:08] https://www.irccloud.com/pastebin/d3nwbdAl/ [17:58:01] (03CR) 10JMeybohm: [C: 03+2] "Thanks! Let's just merge this now and piggyback it with the next update" [debs/helm3] - 10https://gerrit.wikimedia.org/r/696695 (owner: 10Dzahn) [17:58:05] list of wikis. I run it but I need to buy some stuff before they close [17:59:00] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:59:02] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284682 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [17:59:03] aid_Information_Gathering [17:59:07] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284682 (10ops-monitoring-bot) [18:00:04] longma and twentyafterfour: Dear deployers, time to do the Train log triage with CPT deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:04:38] (03CR) 10JMeybohm: [C: 03+1] "You might want to delete the chart with the old name from chartmuseum which probably can be done easiest via swift client directly (as we " [deployment-charts] - 10https://gerrit.wikimedia.org/r/693917 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [18:07:14] !log krinkle@mwmaint1002$ mwscript deleteEqualMessages.php (foreachwiki) [18:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:56] (03CR) 10Dzahn: [C: 03+1] "thanks! this is right" [puppet] - 10https://gerrit.wikimedia.org/r/698963 (owner: 10Muehlenhoff) [18:20:38] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:21:46] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:22:25] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10Ladsgroup) It's majestic <3 [18:38:38] (03PS1) 10Volans: setup.py: fix Django classifier [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699056 [18:38:40] (03PS1) 10Volans: cli: urllib3 backward/forward compatibility [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699057 (https://phabricator.wikimedia.org/T284647) [18:40:36] (03CR) 10Razzi: [V: 03+1] "Came up with the plan with Ottomata to make the memcached packages clean up after themselves using ensure => absent, so nobody will have t" [puppet] - 10https://gerrit.wikimedia.org/r/693981 (https://phabricator.wikimedia.org/T273850) (owner: 10Razzi) [18:59:44] PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.66e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [19:00:05] longma and twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T1900). [19:03:29] (03PS1) 10Jeena Huneidi: group1 wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699060 [19:03:32] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699060 (owner: 10Jeena Huneidi) [19:04:48] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699060 (owner: 10Jeena Huneidi) [19:06:54] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.9 refs T281150 [19:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:58] T281150: 1.37.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T281150 [19:08:01] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.9 refs T281150 (duration: 01m 07s) [19:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:32] (03CR) 10Razzi: "I think this has the wrong phabricator ticket, should be T164456 not T163356" [puppet] - 10https://gerrit.wikimedia.org/r/698767 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [19:14:40] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:14:42] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284690 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [19:14:42] aid_Information_Gathering [19:14:47] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284690 (10ops-monitoring-bot) [19:15:01] "oh no, it's a HP (tm)" [19:15:25] mutante: yes it is HP will replace the BBU tomorrow [19:15:55] mutante: i have already a task open for that [19:16:01] papaul: heh! wow, now that's what I call prompt service [19:16:09] mutante: lol [19:16:12] :) [19:19:11] https://people.wikimedia.org/~dzahn/Screenshot%20at%202021-06-09%2012-17-16.png [19:19:17] papaul: ^:) [19:20:05] let's merge them all into the actual ticket. what is the one you want to use [19:20:38] mutante: https://phabricator.wikimedia.org/T283401 [19:22:14] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10Dzahn) [19:22:17] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284690 (10Dzahn) [19:22:19] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284682 (10Dzahn) [19:22:20] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:22:24] thanks, done [19:23:24] (03PS1) 10Ahmon Dancy: releases1001/1002: Allow the overlay kernel module to load [puppet] - 10https://gerrit.wikimedia.org/r/699063 [19:24:02] (03CR) 10jerkins-bot: [V: 04-1] releases1001/1002: Allow the overlay kernel module to load [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:25:00] (03PS2) 10Ahmon Dancy: releases1001/1002: Allow the overlay kernel module to load [puppet] - 10https://gerrit.wikimedia.org/r/699063 [19:29:17] (03PS1) 10Dzahn: deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) [19:29:52] (03CR) 10jerkins-bot: [V: 04-1] deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:31:01] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:31:22] (03CR) 10Dzahn: "there is no more releases1001. it's 1002 and 2002. but since you want to apply the same thing to all hosts, please use hieradata/role/comm" [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:32:00] (03CR) 10Ahmon Dancy: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:32:45] dancy: hieradata/role/common/ would be better if it's the same for all nodes anyways, no need to clean up when host names change in the future [19:33:01] 👍🏾 [19:34:30] (03PS2) 10Dzahn: deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) [19:34:57] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10KFrancis) @RStallman-legalteam Thanks for taking care of this, Rachel!!! [19:36:02] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:36:11] (03CR) 10jerkins-bot: [V: 04-1] deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:36:34] (03PS3) 10Ahmon Dancy: releases servers: Allow the overlay kernel module to load [puppet] - 10https://gerrit.wikimedia.org/r/699063 [19:37:37] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:39:18] (03CR) 10Ahmon Dancy: [C: 03+1] "Ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:40:01] (03PS3) 10Dzahn: deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) [19:40:58] (03CR) 10Dzahn: "fyi, this can influence Icinga alerting for releases hosts, because whenever we have hosts with docker on them, they get the extra lines i" [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:42:29] (03CR) 10Dzahn: [C: 03+2] "this is like existing situation on CI::master." [puppet] - 10https://gerrit.wikimedia.org/r/699063 (owner: 10Ahmon Dancy) [19:43:15] Thanks Dzahn [19:43:42] dancy: we gotta keep an eye on icinga alerts for releases.. maybe [19:43:46] see comment on gerrit [19:43:51] deploying that right now [19:44:29] Saw that. I see that there is a profile::base::check_disk_options clause in `hieradata/role/common/releases.yaml` so I presume that's what would need to be updated in case of alerts. [19:44:40] yes, that's the one [19:44:55] there would have to be some -i line to ignore certain patterns [19:44:58] like docker [19:46:27] Info: /Stage[main]/Base::Kernel/Kmod::Blacklist[wmf_overlay]/File[/etc/modprobe.d/blacklist-wmf_overlay.conf]: Scheduling refresh of Exec[update-initramfs] [19:46:32] (03PS1) 10DannyS712: Revert "Add type hint to constructor of LanguageConverter" [core] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699014 (https://phabricator.wikimedia.org/T284685) [19:47:35] dancy: can you test on releases1002? [19:48:15] Just did. Looks good. docker starts up using the overlay2 storage driver now. Gonna do a few operations to make sure everything else works. [19:48:28] [releases1002:~] $ lsmod | grep overlay [19:48:28] overlay 131072 0 [19:48:43] cool, ty, applied on 2002 as well [19:49:59] on 2002 it kind of sits there at Scheduling refresh of Exec[update-initramfs] ..hrmm [19:50:20] ok, done now. puppet has totally unrelated puppet failure [19:51:58] [releases1002:~] $ cat /etc/nagios/nrpe.d/check_disk_space.cfg [19:52:18] /usr/lib/nagios/plugins/check_disk -w 6% -c 3% -W 6% -K 3% -l -e -A -i "/run/docker" --exclude-type=fuse.fuse_dfs --exclude-type=tracefs [19:52:29] DISK OK [19:52:57] it's ok with the existing docker exclude [19:53:27] I ran it myself and I see that it included `/srv/docker/overlay2/dc906bc350e1ff1c175e8ec033c607b4c9a111a8b88ead0cacc89b17d037f905/merged` in the output. [19:53:43] but that effectively is just the /srv/docker filesystem, so redundant info. [19:54:13] I see, yea [19:54:40] well, if we dont have to ignore it and still get an OK then all is good [19:54:54] better than ignoring things even [19:55:04] Agreed. Thanks for your help Dzahn/mutante [19:55:07] yw [19:57:04] I cleaned out the devicemapper cruft from both hosts. [19:57:21] nice,thx [20:00:05] longma and twentyafterfour: May I have your attention please! MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T1900) [20:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T2000). [20:06:59] one of these days I need to find out how who came up with these jouncebot callouts :) [20:07:51] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10KFrancis) @RStallman-legalteam Thanks Rachel! [20:09:33] (03CR) 10Dzahn: [C: 04-1] "'join' parameter 'arg' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:21:07] (03PS4) 10Dzahn: deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) [20:23:27] (03PS5) 10Dzahn: deployment::rsync:: temp allow dumping files from miscweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) [20:26:12] (03CR) 10Dzahn: "@apergos here you can see how auto_ferm works: https://puppet-compiler.wmflabs.org/compiler1003/29850/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:26:54] thanks mutante I'll leave that tab open for reading tomorrow! [20:27:46] (03CR) 10Dzahn: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:28:02] yep, added another comment [20:29:25] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29850/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/699064 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:29:30] jouncebot: now [20:29:30] For the next 0 hour(s) and 30 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T1900) [20:29:31] For the next 0 hour(s) and 30 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T2000) [20:32:23] we don't always deploy firewall changes on deployment servers, but if we do we do it during the deployment window [20:36:14] !log deployed temp ferm change on deployment servers to let miscweb dump data, puppetized. scap pull from mwdebug1001 works, deployment good to go [20:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:37] sukhe: you can praise Niharika for most of them. for example -- https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/377945 [20:37:44] see also: https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/380437/2/DefaultConfig.yaml [20:38:57] (03PS22) 10DCausse: rdf-streaming-updater: switch to H/A session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [20:38:59] (03PS7) 10DCausse: Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 [20:39:01] (03PS6) 10DCausse: rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 [20:40:36] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 (owner: 10DCausse) [20:40:36] bd808: ha thanks! [20:41:31] I was thinking that we could use some new ones, but so far I've been too lazy to dream up a new batch [20:45:26] every once in a while it could pull a random one from wikiquote, like https://en.wikiquote.org/wiki/Wikipedia#Quotes quotes about Wikipedia [20:46:03] and on certain days of the week it pulls a random one from the quips app [20:52:35] jeena: if you're done, do you want me to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699014/? [20:52:48] it seems straightforward enough [20:55:03] Yeah, done with train for today. Sounds good! [20:59:59] 10SRE, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10MPhamWMF) Search is implementing a temporary reactive solution to https://phabricator.wikimedia.org/T284479, but will need the issue here regarding au... [21:00:12] !log deploy1002 - creating temp dir /srv/miscweb to rsync static-bugzilla data to, coming from miscweb1002 T281538 [21:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:16] T281538: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 [21:00:59] (03PS5) 10Ssingh: site: add wikidough eqiad with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698505 (https://phabricator.wikimedia.org/T284348) [21:01:20] jeena: oh, so you _were_ deploying.. but nothing was off, right? [21:01:57] i did deploy at around 2 hours ago...everything looked alright to me [21:02:49] jeena: oh, 2 hours ago. ok. Then I can't log "deployed firewall change on deployment servers during deployment and deployers didn't even notice" [21:02:57] :P [21:02:58] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [core] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699014 (https://phabricator.wikimedia.org/T284685) (owner: 10DannyS712) [21:08:28] !log rsyncing static-bugzilla HTML from miscweb1002 to deploy1002 [21:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:30] (03Merged) 10jenkins-bot: Revert "Add type hint to constructor of LanguageConverter" [core] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699014 (https://phabricator.wikimedia.org/T284685) (owner: 10DannyS712) [21:23:52] PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:36] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:15] There's an undeployed patch in deploy1002 [21:34:01] jeena: this is not rebased/deployed https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/698681 [21:34:31] that is weird [21:35:23] shouldn't it have at least been synced during the train today? [21:35:26] jeena: try /srv/mediawiki-staging/php-1.37.0-wmf.9$ git log -p HEAD..@{u} in deploy1001 [21:35:38] it's not rebased, meaning it's not in the files yet [21:35:45] so sync wouldn't deploy it [21:36:23] I can deploy it alongside my other patch, if you think it's okay [21:36:46] yeah, it should be okay [21:36:59] (03PS1) 10Dzahn: Revert "deployment::rsync:: temp allow dumping files from miscweb1002" [puppet] - 10https://gerrit.wikimedia.org/r/699017 [21:37:01] just trying to figure out how it happened [21:38:07] (03CR) 10Dzahn: [C: 03+2] "This won't hurt to be there even before VMs exist." [puppet] - 10https://gerrit.wikimedia.org/r/698505 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [21:38:34] I guess a git pull was missed somewhere [21:38:36] (Shameless promotion) I use https://deploy-commands.toolforge.org/bacc/698681 to reduce chance of making mistakes [21:38:44] :) [21:38:59] I used to make such mistakes in every other deployment :D [21:40:14] your deploy-commands thing is pretty cool! [21:40:43] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.9/includes/language/LanguageConverter.php: Backport: [[gerrit:699014|Revert "Add type hint to constructor of LanguageConverter" (T284685)]] (duration: 01m 24s) [21:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:47] T284685: TypeError: Argument 1 passed to LanguageConverter::__construct() must be an instance of Language, instance of StubUserLang given, called in /srv/mediawiki/php-1.37.0-wmf.9/includes/language/LanguageConverterFactory.php on line 132 - https://phabricator.wikimedia.org/T284685 [21:42:32] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/DiscussionTools/modules/dt-ve/CommentTargetWidget.less: Backport: [[gerrit:698681|Update surface styles for VE changes (T284567)]] (duration: 01m 14s) [21:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:36] T284567: [regression] Reply tool padding and height broken - https://phabricator.wikimedia.org/T284567 [21:42:47] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh1001.wikimedia.org [21:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:42] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10Dzahn) dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 10 --network public eqiad_C doh1001 Ready to create Ganeti VM doh1001... [21:51:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh1001.wikimedia.org [21:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:45] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10Dzahn) dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 10 --network public eqiad_D doh1002 Ready to create Ganeti VM doh1002... [21:53:49] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh1002.wikimedia.org [21:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:27] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) for new host doh1002.wikimedia.org [21:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:51] mutante: _/\_ [22:03:45] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh1002.wikimedia.org [22:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:00] everything worked so far sukhe, ignore exit_code=97 above, that was user error [22:05:11] aborted cookbook without meaning to [22:05:17] and then had to delete IPs in netbox [22:10:52] oh interesting [22:12:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh1002.wikimedia.org [22:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:04] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:50:00] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:50:03] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284709 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [22:50:03] aid_Information_Gathering [22:50:07] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284709 (10ops-monitoring-bot) [22:58:34] (03CR) 10Dzahn: [C: 03+2] Revert "deployment::rsync:: temp allow dumping files from miscweb1002" [puppet] - 10https://gerrit.wikimedia.org/r/699017 (owner: 10Dzahn) [23:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210609T2300). Please do the needful. [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:05:11] sukhe: I could not upload the patch for DHCP... [23:05:27] MAC: doh1001 aa:00:00:88:47:67 [23:05:40] MAC: doh1002 aa:00:00:4d:f8:15 [23:06:00] I gotta go afk for now, if you want to create that go ahead.. otherwise I do it later [23:06:17] process needed is the same as the other day [23:06:30] VMs exist but no OS [23:07:40] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10Dzahn) Here are the MAC addresses, but we still need the DHCP change, I could not upload right now: ` 168 host doh1001 { 169 hardware ethernet aa:00:00:88:47:67; 170... [23:11:39] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:26:01] 10SRE, 10SRE-Access-Requests: Need to ssh with my new laptop - https://phabricator.wikimedia.org/T284588 (10MMiller_WMF) 05Open→03Resolved a:03MMiller_WMF The suggestions about my passphrase stored in Keychain worked! Thank you, @RLazarus, @RhinosF1, and @Joe!