[00:02:34] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:47] (03PS1) 10Legoktm: Add shellbox-constraints to LVS [puppet] - 10https://gerrit.wikimedia.org/r/709566 (https://phabricator.wikimedia.org/T285104) [00:15:49] (03PS1) 10Legoktm: service: Switch shellbox-constraints to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/709567 (https://phabricator.wikimedia.org/T285104) [00:15:51] (03PS1) 10Legoktm: service: Switch shellbox-constraints to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/709568 (https://phabricator.wikimedia.org/T285104) [00:15:53] (03PS1) 10Legoktm: service: Switch shellbox-constraints to production [puppet] - 10https://gerrit.wikimedia.org/r/709569 (https://phabricator.wikimedia.org/T285104) [00:16:03] (03PS1) 10Legoktm: Add shellbox-constraints.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/709571 (https://phabricator.wikimedia.org/T285104) [00:16:05] (03PS1) 10Legoktm: Add shellbox-constraints to discovery [dns] - 10https://gerrit.wikimedia.org/r/709572 (https://phabricator.wikimedia.org/T285104) [00:22:39] !log reedy@deploy1002 Started deploy [integration/docroot@3cff0e4]: (no justification provided) [00:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:47] !log reedy@deploy1002 Finished deploy [integration/docroot@3cff0e4]: (no justification provided) (duration: 00m 08s) [00:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:34] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:47] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) >>! In T287380#7251388, @ssr wrote: > Please enable DPL at least at Main Page of... [00:28:06] (03PS2) 10BryanDavis: toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) [00:29:08] !log reedy@deploy1002 Started deploy [integration/docroot@f7df1c7]: (no justification provided) [00:29:13] !log reedy@deploy1002 Finished deploy [integration/docroot@f7df1c7]: (no justification provided) (duration: 00m 05s) [00:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:52] !log reedy@deploy1002 Started deploy [integration/docroot@f9d225d]: with less gref [00:43:58] !log reedy@deploy1002 Finished deploy [integration/docroot@f9d225d]: with less gref (duration: 00m 05s) [00:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:28] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) a:03sgrabarczuk [01:02:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:21] (03CR) 10BryanDavis: "Preliminary testing done using local-charts and minikube. See https://phabricator.wikimedia.org/P16938 for the values.yaml that I used for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [01:10:20] RECOVERY - dump of s6 in eqiad on alert1001 is OK: Last dump for s6 at eqiad (db1140.eqiad.wmnet:3316) taken on 2021-08-03 00:00:02 (109 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:25:44] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T0200) [02:02:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.17 [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709578 [02:06:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.17 [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709578 (owner: 10TrainBranchBot) [02:07:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-import-siteinfo-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:28] 10SRE, 10Datacenter-Switchover: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) [02:26:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.17 [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709578 (owner: 10TrainBranchBot) [02:43:35] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10sgrabarczuk) - (1) From my perspective, the switchover went smoothly. Most tasks were well documented and automated. I know of no serious consequenc... [02:45:08] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10sgrabarczuk) I've added my two cents in T285806#7254431. These dates work for us. [03:01:16] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:08] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:45] (03PS1) 10Legoktm: Remove DynamicPageList from all Wikimania wikis except 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709585 (https://phabricator.wikimedia.org/T287916) [04:01:48] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:36] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:41] !log Disable puppet on dbproxy1014 dbproxy1013 dbproxy1020 [04:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:16] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:10] (03PS1) 10Marostegui: wmnet: Failover m1, m2 and m3-master [dns] - 10https://gerrit.wikimedia.org/r/709591 (https://phabricator.wikimedia.org/T287574) [05:04:07] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/709591 (https://phabricator.wikimedia.org/T287574) (owner: 10Marostegui) [05:26:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:41] (03PS3) 10KartikMistry: Update cxserver to 2021-08-02-164000-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/709025 (https://phabricator.wikimedia.org/T286473) [05:57:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), No backups: 6 (dbprov1001, ...), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:01:20] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:32] * kart__ updating cxserver [06:11:28] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-08-02-164000-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/709025 (https://phabricator.wikimedia.org/T286473) (owner: 10KartikMistry) [06:13:59] (03Merged) 10jenkins-bot: Update cxserver to 2021-08-02-164000-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/709025 (https://phabricator.wikimedia.org/T286473) (owner: 10KartikMistry) [06:15:38] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [06:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:32] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:05] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:10] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:48] !log Updated cxserver to 2021-08-02-164000-production (T286473) [06:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:56] T286473: Generate template parameter alignments for additional wikis - https://phabricator.wikimedia.org/T286473 [06:33:24] (03PS1) 10Jcrespo: dbbackups: Add s4 to db1139, eqiad backup source [puppet] - 10https://gerrit.wikimedia.org/r/709636 (https://phabricator.wikimedia.org/T280979) [06:34:16] (03PS3) 10Jcrespo: dbbackups: Move s4 from db1145 to db1139 and reimage db1145 to buster [puppet] - 10https://gerrit.wikimedia.org/r/709395 (https://phabricator.wikimedia.org/T280979) [06:34:48] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add s4 to db1139, eqiad backup source [puppet] - 10https://gerrit.wikimedia.org/r/709636 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [06:39:58] PROBLEM - puppet last run on dragonfly-supernode1001 is CRITICAL: CRITICAL: Puppet has been disabled for 605026 seconds, message: dragonfly tests T286054 - jayme, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:01:11] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:43] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) db1183 is now up and replicating from db1107 [07:10:35] RECOVERY - puppet last run on dragonfly-supernode1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:19:09] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) 05Resolved→03Open Reopening, the 'kibana7' service is marked as being setup in puppet (e.g. its alerts don't page) but it is fully in production [07:19:46] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) [07:24:57] (03PS4) 10Jcrespo: dbbackups: Move s4 from db1145 to db1139 and reimage db1145 to buster [puppet] - 10https://gerrit.wikimedia.org/r/709395 (https://phabricator.wikimedia.org/T280979) [07:26:09] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:41] PROBLEM - High average POST latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [07:32:57] we have some anomaly on api servers only (traffic related?) [07:33:07] (03CR) 10Jelto: [V: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:33:49] 301s got 10x [07:34:07] at app layer [07:35:58] but only for api right? [07:36:10] jayme/mutante can you please take a look? ^^ [07:37:00] joe: wilco [07:37:49] I suspect it has zero to do with appservers and everything to do with one database or some traffic pattern, but that surge in POST times is worrisome [07:39:12] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:39:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:42:51] !log upgrading spicerack on cumin2002 to 0.0.57 [07:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:57] (03CR) 10Ema: [C: 03+1] varnish: Improve comments around maps access, retire T261694 [puppet] - 10https://gerrit.wikimedia.org/r/709511 (https://phabricator.wikimedia.org/T261694) (owner: 10Legoktm) [07:45:40] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) The package works fine on Buster, though Stretch is trickier (prometheus codfw/eqiad are stretch) because of missing or old dependencies, namely: * libjs... [07:56:42] 10SRE, 10observability, 10good first task: mtail testing infrastructure prints python deprecation warnings - https://phabricator.wikimedia.org/T285534 (10ema) 05Open→03Resolved a:03ema Thanks @fgiunchedi, I just verified that this is now fixed. [07:57:07] hey folks what is the status? Need help? [07:57:23] (03PS1) 10Dzahn: site/conftool: convert 4 jobrunners to appservers and vice versa for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [07:57:52] I'm a bit slow with stuff like that. So what I see is a bunch of wikidata 301 with UA being python-requests/2.25.1 [07:57:58] elukey: ^ [07:58:05] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: convert 4 jobrunners to appservers and vice versa for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 (owner: 10Dzahn) [07:58:28] (03PS2) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [07:58:34] all(?) from 141.211.192.74 [07:58:47] but that's from what I see on one apiserver only [07:59:01] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 (owner: 10Dzahn) [07:59:57] (03PS1) 10Muehlenhoff: Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 [08:01:39] (03PS2) 10Muehlenhoff: Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 [08:01:55] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [08:03:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [08:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:18] (03CR) 10jerkins-bot: [V: 04-1] Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 (owner: 10Muehlenhoff) [08:10:47] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1145.eqiad.wmnet with reason: REIMAGE [08:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:04] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1145.eqiad.wmnet with reason: REIMAGE [08:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:03] !log pausing refreshLinks run against wikiversities while other issues are figured out [08:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:20] (03PS2) 10David Caro: wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) [08:26:12] RECOVERY - High average POST latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [08:28:46] (03CR) 10David Caro: [C: 04-1] "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/709482 (owner: 10David Caro) [08:32:48] (03PS3) 10Muehlenhoff: Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 [08:33:29] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move s4 from db1145 to db1139 and reimage db1145 to buster [puppet] - 10https://gerrit.wikimedia.org/r/709395 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [08:35:33] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) Since the codfw/eqiad Prometheus hosts are going to be replaced with new HW in Q2, I'm going to force-install prometheus on the stretch hosts for now. It... [08:42:25] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 (owner: 10Muehlenhoff) [08:47:14] (03PS4) 10Muehlenhoff: Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 [08:47:27] (03CR) 10Muehlenhoff: Support test cluster in get_locations() (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 (owner: 10Muehlenhoff) [08:49:29] (03CR) 10JMeybohm: [C: 03+2] Add debian directory [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:49:35] (03CR) 10JMeybohm: [C: 03+2] Create dragonfly user via systemd-sysusers [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:50:10] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1145.eqiad.wmnet with reason: REIMAGE [08:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] (03Merged) 10jenkins-bot: Add debian directory [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:52:33] (03Merged) 10jenkins-bot: Create dragonfly user via systemd-sysusers [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:53:07] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1145.eqiad.wmnet with reason: REIMAGE [08:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:58] (03CR) 10Filippo Giunchedi: am: match the team regexes on instance names too (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709468 (owner: 10David Caro) [08:54:58] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10MoritzMuehlenhoff) >>! In T222113#7254828, @fgiunchedi wrote: > Since the codfw/eqiad Prometheus hosts are going to be replaced with new HW in Q2, I'm going to force... [08:57:19] !log installing pillow security updates on stretch [08:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] marostegui dcaro I'll start with the plan at https://phabricator.wikimedia.org/T287574#7254834 if that works ? [09:01:13] godog: +1 [09:01:44] ok! going ahead [09:02:07] (03CR) 10Filippo Giunchedi: [C: 03+2] haproxy: read config directory natively [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [09:02:12] (03CR) 10Filippo Giunchedi: [C: 03+2] haproxy: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/708105 (owner: 10Filippo Giunchedi) [09:02:39] testing on thumbor1001 [09:03:47] godog:+1] [09:04:12] godog: dbproxy2* can be done anytime, I mean reenabling+testing puppet [09:06:50] (03CR) 10David Caro: am: match the team regexes on instance names too (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709468 (owner: 10David Caro) [09:06:52] ack, change works fine on thumbor so I'll reenable puppet there [09:07:08] (03PS3) 10David Caro: wmcs.puppet_alert: Add failed resources to the email [puppet] - 10https://gerrit.wikimedia.org/r/709477 (https://phabricator.wikimedia.org/T287747) [09:07:10] (03PS2) 10David Caro: wmcs.cloud-init: create a ready file for alerts [puppet] - 10https://gerrit.wikimedia.org/r/709482 [09:07:12] (03PS3) 10David Caro: wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) [09:07:24] godog: ok [09:07:54] godog: I am going to test dbproxy2001 [09:08:02] marostegui: ack sounds good [09:08:16] there won't be a delay in restarting haproxy like it used to be btw [09:08:34] so afaict restart is pretty instantaneous now [09:08:42] yeah, I am seeing that now, that's nice [09:09:08] dbproxy2001 worked fine, can you enable puppet on 2002 and 2003 if you have the command line handy? [09:09:15] marostegui: I will yeah [09:09:34] I think it is pretty safe to do https://phabricator.wikimedia.org/T287574#7248065 those too but NOT 1014, 1013 and 1020 [09:09:35] marostegui: also dbproxy2004 ? [09:09:44] godog: that one doesn't have haproxy running [09:09:59] oh ok, my bad [09:10:05] anyways {{done}} [09:10:13] the above ones too? [09:10:25] no sorry only dbproxy2* [09:10:29] ah ok [09:10:44] let me know when the above are done, so I can double check the standby ones and if all is good, failover the active ones [09:11:03] ok doing so [09:11:32] marostegui: to avoid misunderstandings, I enabled puppet but not run it on dbproxy2* [09:11:50] !log importing dragonfly 1.0.6-2 to buster-wikimedia and stretch-wikimedia - T286054 [09:11:51] godog: that's ok, we can let it run whenever it is time for it to run [09:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:58] T286054: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 [09:12:18] !log installinh php 7.0 security updates on stretch [09:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:19] marostegui: ok all dbproxy reenabled but not 1013/1014/1020 [09:13:39] godog: great, checking all the standby hosts and if ok, I will do the failover [09:13:51] godog: puppet run? [09:14:02] marostegui: no just reenabled, I can run puppet too [09:14:08] please do so [09:14:31] ok! [09:17:02] marostegui: ok puppet has ran [09:17:14] godog: ok, let me check if they are all fine [09:18:21] godog: all good, going to push the dns change [09:18:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1, m2 and m3-master [dns] - 10https://gerrit.wikimedia.org/r/709591 (https://phabricator.wikimedia.org/T287574) (owner: 10Marostegui) [09:18:32] ok [09:18:56] (03PS17) 10Ema: wmflib::role_hosts: new function return list of hosts running a role [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [09:18:57] !log Failover m1, m2 and m3-master T287574 [09:18:58] (03PS1) 10Ema: cache: use wmflib::role_hosts instead of cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/709645 (https://phabricator.wikimedia.org/T282880) [09:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:04] T287574: Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 [09:19:34] godog: Pushed, the TTL is 5 minutes, but let's give it 15-20 for the connections to move smoothly, if you are done from your side. Once it is moved, I can enabled+run puppet on the pending dbproxy* and close the task [09:19:50] marostegui: SGTM [09:20:26] dcaro: cloudcontrol hosts have puppet disabled ATM, ok to reenable and run puppet to apply the change ? [09:21:03] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30461/console" [puppet] - 10https://gerrit.wikimedia.org/r/709645 (https://phabricator.wikimedia.org/T282880) (owner: 10Ema) [09:24:31] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) >>! In T222113#7254861, @MoritzMuehlenhoff wrote: >>>! In T222113#7254828, @fgiunchedi wrote: >> Since the codfw/eqiad Prometheus hosts are going to be r... [09:27:27] godog: yep, go ahead [09:28:37] (03PS1) 10Jcrespo: dbbackups: Reenable notifications after db1145, db1139 reorg [puppet] - 10https://gerrit.wikimedia.org/r/709667 (https://phabricator.wikimedia.org/T280979) [09:28:40] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 (owner: 10Muehlenhoff) [09:28:44] dcaro: ok doing, thanks [09:33:54] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10hnowlan) No objection for sockpuppet, thanks! [09:34:00] (03CR) 10Ema: [C: 04-1] "Thank you so much for this! I tried using the function here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709645/" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [09:36:47] (03PS1) 10MMandere: modules: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) [09:37:53] ok cloudcontrol all done [09:38:39] dcaro: ^ [09:39:24] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) Thank you all for the fast replies! [09:45:29] (03PS5) 10Muehlenhoff: Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 [09:47:34] marostegui: will be afk for 10m [09:47:48] godog: no problem I am going to reenable puppet in like 5-10m [09:48:01] the failover is ok, just giving it extra time for some persistent connections [09:49:40] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications after db1145, db1139 reorg [puppet] - 10https://gerrit.wikimedia.org/r/709667 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [09:50:51] (03CR) 10Muehlenhoff: [C: 03+2] Support test cluster in get_locations() [cookbooks] - 10https://gerrit.wikimedia.org/r/709640 (owner: 10Muehlenhoff) [09:52:04] godog:\o/ [09:55:27] godog: all done, I am closing the task [09:55:50] marostegui: SGTM, thank you for your help cc dcaro [09:56:01] 👍 [09:57:55] (03CR) 10Filippo Giunchedi: am: match the team regexes on instance names too (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709468 (owner: 10David Caro) [09:58:33] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) [10:10:28] (03PS1) 10David Caro: prometheus.icinga_exporter: Use per-label regexes on team labels [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 [10:12:35] (03Abandoned) 10David Caro: am: match the team regexes on instance names too [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709468 (owner: 10David Caro) [10:24:06] (03PS1) 10Elukey: sre.ores.roll-restart-workers: move to LBConfig [cookbooks] - 10https://gerrit.wikimedia.org/r/709672 [10:28:19] (03PS1) 10Marostegui: dbproxy1013,dbproxy1015: Promote db1183 to master [puppet] - 10https://gerrit.wikimedia.org/r/709673 (https://phabricator.wikimedia.org/T287852) [10:28:41] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/709673 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [10:29:11] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [10:49:41] (03CR) 10Vgutierrez: [C: 03+1] Add shellbox-constraints to LVS [puppet] - 10https://gerrit.wikimedia.org/r/709566 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [10:58:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [10:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:18] (03CR) 10Jelto: [V: 03+1 C: 03+2] hiera::role::common::idp add gitlab-replica to production idp [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [11:00:19] ok :) [11:01:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2001.codfw.wmnet [11:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:02] !log rename Ganeti group for test cluster to row_D T286206 [11:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:09] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [11:15:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:32] !log upgrade prometheus5001 to 2.24.1+ds-1+wmf1 - T222113 [11:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:40] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [11:19:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:09] (03PS5) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) [11:23:26] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30463/console" [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [11:25:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [11:25:47] (03CR) 10Volans: [C: 03+1] "LGTM, caveat inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/709672 (owner: 10Elukey) [11:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:07] (03CR) 10Volans: [C: 03+1] "LGTM, amended caveat" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/709672 (owner: 10Elukey) [11:28:11] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [11:28:15] PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqsin+prometheus/ops [11:28:19] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw+prometheus/global [11:28:22] !log upgrade prometheus3001 to 2.24.1+ds-1+wmf1 - T222113 [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:29] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [11:28:54] yeah the prometheus restarts alerts are expected [11:30:57] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [11:31:09] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw+prometheus/global [11:36:39] (03CR) 10Elukey: [C: 03+2] "Thanks a lot for the suggestions!" [cookbooks] - 10https://gerrit.wikimedia.org/r/709672 (owner: 10Elukey) [11:36:45] !log updated bullseye d-i images to rc3 T275873 [11:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:52] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [11:41:18] 10SRE, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10MoritzMuehlenhoff) Noticed this during clinic duty: @ssingh If the decom cookbook ran on the host, you can can tick off the relevant parts under "Steps for service owner" and reassig... [11:42:23] (03PS1) 10Elukey: sre.ores.roll-restart: fix usage of LBRemoteCluster [cookbooks] - 10https://gerrit.wikimedia.org/r/709679 [11:46:25] (03CR) 10Elukey: [C: 03+2] sre.ores.roll-restart: fix usage of LBRemoteCluster [cookbooks] - 10https://gerrit.wikimedia.org/r/709679 (owner: 10Elukey) [11:48:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2001.codfw.wmnet [11:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:13] RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqsin+prometheus/ops [11:51:16] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled on imposm hosts. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:31] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [11:58:39] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw+prometheus/global [11:59:23] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [11:59:33] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw+prometheus/global [12:01:05] 10SRE, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10ssingh) a:03Jclark-ctr [12:05:18] !log installing libgcrypt20 security updates [12:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:27] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) [12:21:31] (03CR) 10Ema: [C: 03+1] trafficserver::text: open mwdebug on k8s again [puppet] - 10https://gerrit.wikimedia.org/r/709392 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [12:22:36] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10klausman) I did an analysis of the ATS and Varnish Kafka topics as reported for `cp3050.esams.wmnet` (the only host that currently feeds... [12:35:16] (03CR) 10Daimona Eaytoy: [C: 03+1] Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 (owner: 10Legoktm) [12:45:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver::text: open mwdebug on k8s again [puppet] - 10https://gerrit.wikimedia.org/r/709392 (https://phabricator.wikimedia.org/T283056) (owner: 10Giuseppe Lavagetto) [12:47:18] !log restarting Tomcat on idp1001 [12:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:53] (03PS3) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709053 [12:47:55] (03CR) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [12:47:57] (03PS5) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [12:51:22] (03PS2) 10David Caro: prometheus: added some wmcs team label configs [puppet] - 10https://gerrit.wikimedia.org/r/709471 [12:52:55] (03CR) 10David Caro: [C: 03+2] wmcs.puppet_alert: Add failed resources to the email [puppet] - 10https://gerrit.wikimedia.org/r/709477 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [13:01:03] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:06:51] (03PS1) 10Elukey: Expose Spicerack's Cumin config [software/spicerack] - 10https://gerrit.wikimedia.org/r/709691 [13:14:15] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10Ottomata) 10-20 seconds / 0.02% missing seems acceptable to me. Perhaps this is enough verification to proceed? [13:14:41] (03PS1) 10MVernon: Correct documented path of wmf-update-ssh-config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709693 [13:14:43] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709693 (owner: 10MVernon) [13:19:58] (03PS1) 10Btullis: Add a CNAME entry for analytics-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/709695 (https://phabricator.wikimedia.org/T273642) [13:23:19] (03CR) 10MVernon: "Hi," [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709693 (owner: 10MVernon) [13:25:26] (03Abandoned) 10Elukey: Expose Spicerack's Cumin config [software/spicerack] - 10https://gerrit.wikimedia.org/r/709691 (owner: 10Elukey) [13:32:38] (03PS1) 10Elukey: sre.ores.roll-restart-workers: fix usage of LBRemoteConfig - part 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 [13:33:46] still not ok --^ [13:37:49] (03PS1) 10Volans: Add new codfw_test ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/709701 [13:38:31] (03CR) 10Volans: Add new codfw_test ganeti cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709701 (owner: 10Volans) [13:40:29] (03PS1) 10JMeybohm: dragonfly: Enable metric scraping for dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/709703 (https://phabricator.wikimedia.org/T286054) [13:40:32] (03PS1) 10JMeybohm: prometheus::ops: Scrape metrics from dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/709704 (https://phabricator.wikimedia.org/T286054) [13:43:08] (03PS1) 10Kormat: .mailmap: Map old email address to correct name/email. [puppet] - 10https://gerrit.wikimedia.org/r/709705 [13:43:45] (03PS1) 10Volans: sre.ganeti.makevm: make error message more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/709706 [13:43:53] (03CR) 10Kormat: [C: 03+2] .mailmap: Map old email address to correct name/email. [puppet] - 10https://gerrit.wikimedia.org/r/709705 (owner: 10Kormat) [13:44:38] (03PS2) 10Elukey: sre.ores.roll-restart-workers: fix usage of LBRemoteConfig - part 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 [13:45:28] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10BTullis) I'm trying to get my head around what the implications of these two statements are: > usually, there are 0.02% of events that ar... [13:47:40] (03CR) 10Volans: "LGTM, question inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 (owner: 10Elukey) [13:47:47] (03CR) 10Volans: [C: 03+1] sre.ores.roll-restart-workers: fix usage of LBRemoteConfig - part 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 (owner: 10Elukey) [13:49:51] (03PS3) 10Elukey: sre.ores.roll-restart-workers: fix usage of LBRemoteConfig - part 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 [13:50:06] (03CR) 10Elukey: sre.ores.roll-restart-workers: fix usage of LBRemoteConfig - part 2 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 (owner: 10Elukey) [13:51:39] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10Ottomata) I think thats right! [13:55:55] (03CR) 10Elukey: [C: 03+2] sre.ores.roll-restart-workers: fix usage of LBRemoteConfig - part 2 [cookbooks] - 10https://gerrit.wikimedia.org/r/709699 (owner: 10Elukey) [13:57:41] (03PS1) 10DCausse: helpers: do not repeat ports section for kafka brokers egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 [13:59:26] (03PS1) 10Btullis: Update Presto TLS configuration in production [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [14:00:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709703 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:01:43] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:03:46] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10LSobanski) [14:04:46] (03CR) 10Andrew Bogott: [C: 03+1] "I don't feel all that strongly about the name, although I think it makes sense to be specific since a server goes through many different s" [puppet] - 10https://gerrit.wikimedia.org/r/709482 (owner: 10David Caro) [14:05:15] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [14:06:25] (03PS2) 10Herron: elk7: change kibana7 monitoring to critical [puppet] - 10https://gerrit.wikimedia.org/r/655957 (https://phabricator.wikimedia.org/T234854) [14:08:04] (03CR) 10Herron: [C: 03+2] elk7: change kibana7 monitoring to critical [puppet] - 10https://gerrit.wikimedia.org/r/655957 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [14:09:24] (03PS2) 10Herron: dns: remove logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/655959 (https://phabricator.wikimedia.org/T234854) [14:10:26] (03CR) 10Muehlenhoff: Add new codfw_test ganeti cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709701 (owner: 10Volans) [14:10:47] (03PS1) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [14:10:49] (03CR) 10Herron: [C: 03+2] dns: remove logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/655959 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [14:11:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/709706 (owner: 10Volans) [14:11:26] (03CR) 10Volans: "The uncommented changes looks good to me, for the others I'm not sure or needs a change I've left a comment." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:11:47] (03PS1) 10Giuseppe Lavagetto: mwdebug: sync eqiad and codfw, drop node confinemet [deployment-charts] - 10https://gerrit.wikimedia.org/r/709718 [14:12:32] (03PS2) 10Herron: elk7: remove logstash-next cache setting [puppet] - 10https://gerrit.wikimedia.org/r/655958 (https://phabricator.wikimedia.org/T234854) [14:13:16] !log chown dumpsgen and chmod 644 dumpsdata1003:/data/xmldatadumps/public/lezwiki/20210801/dumpstatus.json (it was only readable by root causing an analytics import job to fail), ping apergos [14:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:30] (03CR) 10Herron: [C: 03+2] elk7: remove logstash-next cache setting [puppet] - 10https://gerrit.wikimedia.org/r/655958 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [14:14:25] (03PS1) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [14:14:54] (03CR) 10jerkins-bot: [V: 04-1] Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:16:56] (03CR) 10David Caro: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/709482 (owner: 10David Caro) [14:17:02] (03CR) 10David Caro: [C: 03+2] wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [14:17:43] (03PS3) 10David Caro: wmcs.cloud-init: create a ready file for alerts [puppet] - 10https://gerrit.wikimedia.org/r/709482 [14:17:51] (03PS4) 10David Caro: wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) [14:18:20] (03PS5) 10David Caro: wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) [14:18:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add a temporary role for appservers plus docker and dragonfly (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:18:45] (03PS2) 10Herron: kibana7: remove kibana-next conftool entries [puppet] - 10https://gerrit.wikimedia.org/r/654438 (https://phabricator.wikimedia.org/T234854) [14:19:08] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10MoritzMuehlenhoff) How about we create a mechanism similar to the logout.d scripts, but for decom? Let's say we create a new... [14:19:34] 10SRE, 10Analytics: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10elukey) [14:20:04] (03CR) 10Muehlenhoff: Add new codfw_test ganeti cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709701 (owner: 10Volans) [14:20:35] (03CR) 10JMeybohm: [C: 03+1] "Cool!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709718 (owner: 10Giuseppe Lavagetto) [14:21:01] (03CR) 10Herron: [C: 03+2] kibana7: remove kibana-next conftool entries [puppet] - 10https://gerrit.wikimedia.org/r/654438 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [14:21:36] (03PS2) 10Volans: Add new codfw_test ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/709701 [14:21:59] (03CR) 10Volans: "addressed comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709701 (owner: 10Volans) [14:22:17] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: make error message more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/709706 (owner: 10Volans) [14:22:20] (03PS2) 10Volans: sre.ganeti.makevm: make error message more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/709706 [14:22:46] (03Abandoned) 10Herron: logstash-next: change backend naming from kibana-next to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [14:22:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/709701 (owner: 10Volans) [14:23:16] (03CR) 10David Caro: [C: 03+2] wmcs.puppet_alert: Don't fail if the host is not ready (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [14:23:40] !log chown dumpsgen and chmod 644 /data/xmldatadumps/public/lezwiki/20210801/dumpstatus.json on labstore1006 and labstore1007 (it was only readable by root causing an analytics import job to fail), ping apergos [14:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:37] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [cookbooks] - 10https://gerrit.wikimedia.org/r/709720 [14:26:13] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) 05Open→03Resolved alerting and several other cleanup patches merged [14:26:56] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [labs/private] - 10https://gerrit.wikimedia.org/r/709721 [14:27:28] !log chown dumpsgen and chmod 644 /data/xmldatadumps/public/*/20210801/dumpstatus.json on labstore1006 and labstore1007 (it was only readable by root causing an analytics import job to fail), ping apergos [14:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: sync eqiad and codfw, drop node confinemet [deployment-charts] - 10https://gerrit.wikimedia.org/r/709718 (owner: 10Giuseppe Lavagetto) [14:30:10] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/709722 [14:30:41] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709723 [14:30:58] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Volans) @MoritzMuehlenhoff 's proposal is certainly a neat option but I have a couple of worries, namely: - it might be hard... [14:31:04] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [dns] - 10https://gerrit.wikimedia.org/r/709724 [14:31:14] (03CR) 10Volans: [C: 03+2] Add new codfw_test ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/709701 (owner: 10Volans) [14:31:42] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709725 [14:32:58] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:33:03] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:20] (03PS2) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [14:33:48] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [software/transferpy] - 10https://gerrit.wikimedia.org/r/709727 [14:33:50] (03CR) 10jerkins-bot: [V: 04-1] Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:34:19] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/709728 [14:34:38] (03PS3) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [14:34:42] (03CR) 10Muehlenhoff: modules: Add drmrs DC site (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:35:08] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/709729 [14:35:11] (03CR) 10jerkins-bot: [V: 04-1] Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:35:23] PROBLEM - Check systemd state on backup1007 is CRITICAL: CRITICAL - degraded: The following units failed: proc-sys-fs-binfmt_misc.automount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:30] (03PS1) 10Kormat: mailmap: Add mapping for my name/email address. [software] - 10https://gerrit.wikimedia.org/r/709730 [14:35:46] checking backup1007 [14:36:11] (03PS2) 10Kormat: mailmap: Add mapping for my name/email address. [dns] - 10https://gerrit.wikimedia.org/r/709724 [14:37:16] (03PS2) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [14:37:17] RECOVERY - Check systemd state on backup1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:43] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Kormat) >>! In T287954#7255698, @Volans wrote: > - some "decom" actions should be performed from a central host instead of t... [14:44:06] (03CR) 10Kormat: [C: 03+1] dbproxy1013,dbproxy1015: Promote db1183 to master [puppet] - 10https://gerrit.wikimedia.org/r/709673 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [14:44:53] (03PS3) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [14:45:45] (03PS1) 10Herron: logstash: add logstash103[345] to eqiad elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709731 (https://phabricator.wikimedia.org/T287938) [14:45:47] (03PS1) 10Herron: logstash: add logstash203[345] to codfw elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709732 (https://phabricator.wikimedia.org/T287938) [14:46:05] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30466/console" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [14:48:19] (03PS2) 10Herron: logstash: add logstash103[345] to eqiad elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709731 (https://phabricator.wikimedia.org/T287938) [14:49:48] (03PS1) 10MVernon: Add new Data Persistence SRE Matthew Vernon / mvernon to ops [puppet] - 10https://gerrit.wikimedia.org/r/709733 [14:49:58] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet [14:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:51] (03CR) 10DCausse: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:51:59] (03CR) 10jerkins-bot: [V: 04-1] eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:53:58] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 (owner: 10Kormat) [14:54:10] (03PS4) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [14:56:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2001.codfw.wmnet [14:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codfw.wmnet (**FAIL**) - **Host steps raised exc... [15:01:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2002.codfw.wmnet [15:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:55] (03CR) 10Kormat: [C: 04-1] Add new Data Persistence SRE Matthew Vernon / mvernon to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709733 (owner: 10MVernon) [15:04:28] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 (owner: 10Kormat) [15:04:56] (03CR) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:05:27] (03CR) 10Kormat: [V: 03+2 C: 03+2] mailmap: Add mapping for my name/email address. [labs/private] - 10https://gerrit.wikimedia.org/r/709721 (owner: 10Kormat) [15:05:41] (03CR) 10Kormat: [V: 03+2 C: 03+2] mailmap: Add mapping for my name/email address. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/709722 (owner: 10Kormat) [15:06:00] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [cookbooks] - 10https://gerrit.wikimedia.org/r/709720 (owner: 10Kormat) [15:06:10] (03CR) 10Kormat: [V: 03+2 C: 03+2] mailmap: Add mapping for my name/email address. [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709723 (owner: 10Kormat) [15:07:15] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [software] - 10https://gerrit.wikimedia.org/r/709730 (owner: 10Kormat) [15:07:24] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/709729 (owner: 10Kormat) [15:10:07] (03CR) 10jerkins-bot: [V: 04-1] mailmap: Add mapping for my name/email address. [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 (owner: 10Kormat) [15:12:28] (03PS1) 10DCausse: rdf-streaming-updater: Add explicit egress rules for kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/709735 [15:12:42] (03PS5) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [15:13:29] (03CR) 10Volans: [C: 03+1] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 (owner: 10Kormat) [15:13:41] (03Merged) 10jenkins-bot: mailmap: Add mapping for my name/email address. [cookbooks] - 10https://gerrit.wikimedia.org/r/709720 (owner: 10Kormat) [15:13:43] (03Merged) 10jenkins-bot: mailmap: Add mapping for my name/email address. [software] - 10https://gerrit.wikimedia.org/r/709730 (owner: 10Kormat) [15:14:01] (03CR) 10Volans: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 (owner: 10Kormat) [15:14:03] (03PS2) 10MVernon: admin: Add new Data Persistence SRE Matthew Vernon / mvernon to ops [puppet] - 10https://gerrit.wikimedia.org/r/709733 [15:14:09] (03Merged) 10jenkins-bot: mailmap: Add mapping for my name/email address. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/709729 (owner: 10Kormat) [15:14:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2002.codfw.wmnet [15:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:53] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:41] (03PS6) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [15:16:44] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) Something that I noticed, that may be totally off: ` scala> spark.sql("SELECT count(*) FROM wmf.webrequest where webrequest_sourc... [15:16:46] (03PS1) 10Muehlenhoff: Add DHCP entry for testvm2002, running on ganeti-test01 [puppet] - 10https://gerrit.wikimedia.org/r/709736 [15:17:09] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709733 (owner: 10MVernon) [15:18:03] (03PS1) 10Btullis: Add presto keytabs to the cluster coordinator replica role [puppet] - 10https://gerrit.wikimedia.org/r/709737 (https://phabricator.wikimedia.org/T273642) [15:18:11] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [dns] - 10https://gerrit.wikimedia.org/r/709724 (owner: 10Kormat) [15:18:42] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/709728 (owner: 10Kormat) [15:18:57] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [software/transferpy] - 10https://gerrit.wikimedia.org/r/709727 (owner: 10Kormat) [15:19:13] (03PS1) 10Herron: kibana7: switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/709738 [15:19:19] (03CR) 10Kormat: [C: 03+2] mailmap: Add mapping for my name/email address. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709725 (owner: 10Kormat) [15:19:30] (03CR) 10MVernon: [C: 03+2] admin: Add new Data Persistence SRE Matthew Vernon / mvernon to ops [puppet] - 10https://gerrit.wikimedia.org/r/709733 (owner: 10MVernon) [15:19:41] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Add explicit egress rules for kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/709735 (owner: 10DCausse) [15:20:42] (03Merged) 10jenkins-bot: mailmap: Add mapping for my name/email address. [software/spicerack] - 10https://gerrit.wikimedia.org/r/709726 (owner: 10Kormat) [15:20:48] kormat: Remember to sync your prod MW-config change (or I can). [15:21:32] (03PS7) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [15:21:52] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10lmata) [15:22:27] (03Merged) 10jenkins-bot: rdf-streaming-updater: Add explicit egress rules for kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/709735 (owner: 10DCausse) [15:23:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30470/console" [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:23:04] James_F: changes to .mailmap get synced? 👀 [15:23:33] oh. well, i've only just added the file [15:23:48] James_F: my assumption would be that this isn't a file that needs to be synced, but icbw [15:24:21] kormat: It disrupts the deployment server and needs to be manually pulled (if not synced). [15:24:26] I'll do it, no worries. :-) [15:24:32] James_F: uff. thanks for catching that! [15:24:56] kormat: Aka MW deployment is a mess. MW-on-k8s will remove this problem, but for now… [15:25:05] !log prune testvm2001 from Ganeti and clean up from Netbox (stuck in some inconsistent state the decom cookbook can't handle) T286206 [15:25:06] (Done.) [15:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:13] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [15:25:15] James_F: noted. and thanks again <3 [15:26:22] Of course. Happy to help. [15:26:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:38] (03PS1) 10JMeybohm: site: Switch a bunch of eqiad appservers to appserver_dragonfly role [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) [15:30:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:37] (03Abandoned) 10JMeybohm: Replace all consumers of docker-registry credentials with alias [labs/private] - 10https://gerrit.wikimedia.org/r/699414 (owner: 10JMeybohm) [15:34:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:22] (03PS8) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [15:35:25] (03PS2) 10JMeybohm: site: Switch a bunch of eqiad appservers to appserver_dragonfly role [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) [15:36:15] (03PS4) 10Hnowlan: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) [15:36:23] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30471/console" [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:37:50] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30472/console" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [15:38:00] (03PS1) 10Kormat: admin: Fix old email address mentions. [puppet] - 10https://gerrit.wikimedia.org/r/709741 [15:41:34] (03CR) 10Kormat: [C: 03+2] admin: Fix old email address mentions. [puppet] - 10https://gerrit.wikimedia.org/r/709741 (owner: 10Kormat) [15:49:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:49] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709704 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:50:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana7: switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/709738 (owner: 10Herron) [15:51:27] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add logstash203[345] to codfw elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709732 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [15:51:31] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add logstash103[345] to eqiad elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709731 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [15:55:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:53] (03CR) 10Herron: [C: 03+2] kibana7: switch state to production [puppet] - 10https://gerrit.wikimedia.org/r/709738 (owner: 10Herron) [16:00:04] jbond42 and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T1600). [16:00:25] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) We met today and this is the plan forward: 1) use `topicmappr` to create a list o... [16:00:45] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:47] (03CR) 10Ahmon Dancy: [C: 03+2] pom: overwrite gerrit.war [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/709509 (owner: 10Hashar) [16:09:00] (03Merged) 10jenkins-bot: pom: overwrite gerrit.war [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/709509 (owner: 10Hashar) [16:13:19] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10KFrancis) @RLazarus I am confirming the NDA has been signed. Please proceed with the access request. Thanks! [16:14:10] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:26] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [16:18:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:44] (03CR) 10Ahmon Dancy: [C: 03+2] Gerrit 3.3.5 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709501 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [16:25:35] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) The DIMM has arrived, the server will need to be taken offline for a few minutes do swap the DIMM. [16:26:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) The DIMM arrived, is it safe to turn the server off and swap the DIMM? [16:27:01] (03Merged) 10jenkins-bot: Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [16:27:05] (03Merged) 10jenkins-bot: Gerrit 3.3.5 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709501 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [16:27:56] !log Going to upgrade Gerrit 3.3 (scheduled maintenance) [16:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:37] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10Papaul) [16:32:02] there will be some gerrit alarms cause we cant put it in maintenance mode [16:33:44] 10SRE, 10ops-eqiad, 10User-fgiunchedi: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10Cmjohnson) 05Open→03Resolved The disk has been replaced, the system is now reporting healthy. [16:35:13] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10RLazarus) a:05KFrancis→03MoritzMuehlenhoff Thanks Katie! Assigning this over to this week's clinic duty SRE. [16:36:34] !log dancy@deploy1002 Started deploy [gerrit/gerrit@244120b]: Gerrit to 3.3.5 on gerrit2001 [16:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:44] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@244120b]: Gerrit to 3.3.5 on gerrit2001 (duration: 00m 10s) [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:39] RECOVERY - MegaRAID on thanos-be1003 is OK: OK: optimal, 13 logical, 13 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:40:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) @Cmjohnson I can do that now, let me know if that works. If not, just let me know when it would work for you and I will get the server offline for you. [16:40:49] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) @Cmjohnson I can do that now, let me know if that works. If not, just let me know when it would work for you and I will get the server offline for you. [16:43:40] !log upgraded spicerack to 0.0.57-1+deb10u1 on cumin1001 [16:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:32] !log Stopping Gerrit for upgrade [16:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:52] !log Start server side upload for 1 video file (T287957) [16:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:00] T287957: Server side upload for BChoo (WMF) - https://phabricator.wikimedia.org/T287957 [16:47:43] !log dancy@deploy1002 Started deploy [gerrit/gerrit@244120b]: Gerrit to 3.3.5 on gerrit1001 [16:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:51] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@244120b]: Gerrit to 3.3.5 on gerrit1001 (duration: 00m 07s) [16:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:29] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) These disks are not hot swappable, it appears that they're software raid 1, the disk was swapped but will need to be manually added back to the raid configuration. [16:49:40] 10ops-eqiad, 10DBA: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10Cmjohnson) @marostegui the disk has been swapped but it appears to have been removed. You will need to add back to the raid configuration. Resolve this task after you restore the raid config. [16:49:59] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) @marostegui yes please [16:50:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) @marostegui yes please [16:51:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gerrit,gerrit-metrics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:52:13] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) @Cmjohnson I just realised that this host is unreachable, so you can proceed with it anytime and power it back on when you are done. Thanks [16:52:33] PROBLEM - Host backup1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:53] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) @Cmjohnson host off - you can proceed as needed [16:53:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:54:17] (03PS1) 10Hashar: gerrit: align replication plugin settings [puppet] - 10https://gerrit.wikimedia.org/r/709767 [16:54:40] PROBLEM - Host backup1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:55:54] (03CR) 10Brennen Bearnes: [C: 03+1] gerrit: align replication plugin settings [puppet] - 10https://gerrit.wikimedia.org/r/709767 (owner: 10Hashar) [16:57:07] RECOVERY - Host backup1006 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:59:11] !log Gerrit has been upgraded [16:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T1700). [17:00:25] RECOVERY - Host backup1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [17:04:32] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) 05Open→03Resolved DIMM A3 was replaced and the log was cleared. [17:12:08] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) Memory looks good now. This host needs to be recloned - I will do that tomorrow Thanks Chris [17:13:33] (03CR) 10Hashar: "We have spotted a change in replication.config due to a new setting. gerrit init insist on writing the default value to the config file :]" [puppet] - 10https://gerrit.wikimedia.org/r/709767 (owner: 10Hashar) [17:16:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) 05Open→03Resolved @marostegui the DIMM was replaced, logged cleared and powered on. This should resolve your issue [17:17:23] now that gerrit is upgrade, how do I see what needs my attention? [17:17:58] (03PS1) 10Ebernhardson: This reverts commit 76a2ff29ae84c32620bc794e2c58974f6daab4d1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709770 (https://phabricator.wikimedia.org/T277225) [17:18:12] * thcipriani reads https://gerrit.wikimedia.org/r/Documentation/user-attention-set.html [17:18:19] I assume that the default dashboard will effectively show that info (among your own changes) [17:18:39] yeah, there's a "Your Turn" section at the top of the default dashboard. [17:18:57] yeah, very neat: https://gerrit.wikimedia.org/r/Documentation/user-attention-set.html#_dashboard [17:19:11] I don't think you have any attention set yet [17:19:11] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: align replication plugin settings [puppet] - 10https://gerrit.wikimedia.org/r/709767 (owner: 10Hashar) [17:19:20] ^ [17:19:22] (03PS2) 10Ebernhardson: Re-enable commonswiki sister search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709770 (https://phabricator.wikimedia.org/T277225) [17:19:23] gotta wait for folks to act on change which will suggest attentionees [17:21:25] is there a search keyword? [17:22:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) 05Resolved→03Open @Cmjohnson it seems that the host isn't reachable - could you take a look to see if there's any error preventing it to boot up? Thanks! [17:23:18] ah, there is: attention:'USER' [17:23:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10Cmjohnson) a:05Cmjohnson→03wiki_willy these are the 2.5 disks and unfortunately, I do not have a spare or a decommissioned server... [17:24:00] i like this new attention feature [17:25:01] (03PS2) 10Ryan Kemper: analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) [17:25:09] yeah, I'll be interested if it has any impact on https://wikimedia.biterg.io/app/kibana#/dashboard/Gerrit-Timing [17:25:48] time will show [17:25:50] (03PS3) 10Ryan Kemper: analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) [17:25:52] (03PS1) 10Papaul: Add new ganeti-test nodes to DHCP file, site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/709773 (https://phabricator.wikimedia.org/T286484) [17:28:12] (03CR) 10Papaul: [C: 03+2] Add new ganeti-test nodes to DHCP file, site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/709773 (https://phabricator.wikimedia.org/T286484) (owner: 10Papaul) [17:33:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) The host is now up and the memory is ok - thanks! This host needs recloning - will do it tomorrow and then close the task Thanks for your help Chris [17:33:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ganeti-test2001.codfw.wmnet ` The log can be... [17:34:09] (03PS4) 10Ryan Kemper: analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) [17:35:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10Papaul) [17:39:25] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=mysql file=device_smart.prom instance=db1170 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:46:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10wiki_willy) a:05wiki_willy→03RKemper Hi @RKemper - since elastic1039 is currently at the 5yr mark, and we're currently installing... [17:49:43] (03PS1) 10Dduvall: testwikis wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709778 [17:49:45] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709778 (owner: 10Dduvall) [17:52:28] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709778 (owner: 10Dduvall) [17:52:33] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.17 [17:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:54] (03CR) 10Ottomata: analytics: commission new webserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [17:59:19] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: Not alerting due to fresh production wikiversions: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:59:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti-test2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['ganeti-test2001.codfw.wmnet'... [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T1800) [18:05:43] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@f0f70d1]: T286642 fixes to bulk daemon prioritization [18:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:51] T286642: mjolnir-bulk-update should consume prioritized topic quicker than normal ones - https://phabricator.wikimedia.org/T286642 [18:06:31] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@f0f70d1]: T286642 fixes to bulk daemon prioritization (duration: 00m 48s) [18:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:17] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.17 (duration: 36m 44s) [18:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:38] hmm, did mwdebug-deploy deploy it faster to k8s than the actual scap took? [18:31:01] I guess much less servers to sync to [18:35:14] legoktm: out of curiosity, does the sync log for k8s mean "sync started" or "finished"? [18:35:30] started [18:36:00] i see [18:44:12] !log otto@deploy1002 Started deploy [analytics/refinery@aceb561]: Regular analytics weekly train [analytics/refinery@aceb561] [18:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:10] !log running mwscript migrateUserGroup.php --wiki=idwiki editor reviewer (T286853) [18:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:17] T286853: Reassign idwiki Editors to be Reviewers - https://phabricator.wikimedia.org/T286853 [18:48:13] (03PS5) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [18:54:07] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:54:13] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:54:55] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:56:11] 10SRE, 10Traffic, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Legoktm) [19:00:04] dduvall and twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train - American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T1900). [19:00:38] !log otto@deploy1002 Finished deploy [analytics/refinery@aceb561]: Regular analytics weekly train [analytics/refinery@aceb561] (duration: 16m 25s) [19:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:50] !log otto@deploy1002 Started deploy [analytics/refinery@aceb561] (thin): Regular analytics weekly train THIN [analytics/refinery@aceb561] [19:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:57] !log otto@deploy1002 Finished deploy [analytics/refinery@aceb561] (thin): Regular analytics weekly train THIN [analytics/refinery@aceb561] (duration: 00m 07s) [19:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:35] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:03:41] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:04:25] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:05:33] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:06:18] (03PS1) 10Ottomata: refine - bump refinery version to pick up normalized_host transform [puppet] - 10https://gerrit.wikimedia.org/r/709810 (https://phabricator.wikimedia.org/T251320) [19:06:43] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:11:30] (03PS1) 10Dduvall: group0 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709811 [19:11:32] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709811 (owner: 10Dduvall) [19:12:45] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709811 (owner: 10Dduvall) [19:14:07] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.17 [19:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:23] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:25] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: Not alerting due to fresh production wikiversions: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:19:27] (03CR) 10Ottomata: [C: 03+2] refine - bump refinery version to pick up normalized_host transform [puppet] - 10https://gerrit.wikimedia.org/r/709810 (https://phabricator.wikimedia.org/T251320) (owner: 10Ottomata) [19:21:25] odd. i'm seeing a slew of "PHP Warning: Class __PHP_Incomplete_Class has no unserializer" errors following wmf.17 group0 promotion but they're all occurring with wmf.16 [19:22:49] well, time to rollback [19:23:21] I think that means it's trying to unserialize a class that no longer exists or was renamed [19:23:37] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Jclark-ctr) @Marostegui can we schedule downtime tomorrow for me to look at this around 4pm est ? [19:23:45] (03PS5) 10Ryan Kemper: analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) [19:24:01] legoktm: any idea why it would occur in wmf.16 when wmf.17 is promoted? [19:24:22] (03CR) 10Ottomata: [C: 03+1] analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [19:24:24] maybe this is a stupid idea...group0 wiki serializes something that is then deserialized by a group1/group2 wiki? [19:24:37] ^^ likely that [19:24:40] doesn't seem stupid to me [19:24:43] (03PS6) 10Ryan Kemper: analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) [19:24:55] that's what I'd imagine as well... but can't say without a stack trace [19:25:10] k. i'll rollback for now and gather more info for a bug report [19:25:25] majavah: here you go https://www.irccloud.com/pastebin/x70DJlKm/ [19:25:35] !log otto@deploy1002 Started deploy [analytics/refinery@aceb561] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@aceb561] [19:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:50] (03CR) 10Ryan Kemper: [C: 03+2] analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [19:26:05] nothing has changed in Gadgets recently... [19:27:06] or also this happens https://www.irccloud.com/pastebin/bC2vdeSG/ [19:27:32] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: revert group0 wikis to 1.37.0-wmf.16 [19:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:49] ...i opened several stacktraces, and they're all different [19:28:14] so more likely something lower [19:28:23] (03PS1) 10Dduvall: Revert "group0 wikis to 1.37.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709813 [19:28:25] looks so [19:28:25] (03CR) 10Dduvall: [C: 03+2] Revert "group0 wikis to 1.37.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709813 (owner: 10Dduvall) [19:28:42] https://www.mediawiki.org/wiki/MediaWiki_1.37/wmf.17 doesn't have relnotes yet :( [19:28:55] i also see some instances of the error for mediawikiwiki, which is group0 IIRC [19:29:08] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ganeti-test2002.codfw.wmnet ` The log can be found in `/var/log/wm... [19:29:10] ouch.. I'm on a new laptop without a core clone handy... :/ [19:29:11] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.37.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709813 (owner: 10Dduvall) [19:29:35] majavah: clone it now! It will always come handy later :D [19:30:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:12] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Marostegui) That's around 10pm for me (and for @kormat) and 9pm for @lsobanski, any chances this can be done a bit earlier? Another option would be to leave the host down for you to work it out and power it back... [19:30:19] I'm guessing its https://gerrit.wikimedia.org/r/c/mediawiki/core/+/691272 [19:30:27] yeah, let's see how long this takes with my mobile connection [19:30:53] the errors stopped appearing [19:31:15] !log otto@deploy1002 Finished deploy [analytics/refinery@aceb561] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@aceb561] (duration: 05m 40s) [19:31:16] !log T285355 `ryankemper@an-web1001:~$ sudo run-puppet-agent` to establish `role(analytics_cluster::webserver)` on the host in preparation for upcoming cutover from `thorium`->`an-web1001` [19:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:28] T285355: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 [19:31:35] or...at least appear less frequently (just one at Aug 3, 2021 @ 19:30:23.654) [19:32:23] testwikis are still at wmf.17 [19:32:38] right [19:33:31] legoktm: does the patch you linked revert cleanly? it seems to have a bunch of merged changes on top of it [19:34:52] I haven't checked yet, I don't actually see what's wrong with it, but it looks like the only class rename I see, plus it was intentionally serializable [19:35:23] majavah: cleanly enough. RELEASE-NOTES-1.37 is the only conflict [19:35:35] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Jclark-ctr) I would prefer leave host down for me to work it out and power it back when finished [19:36:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:37] we can try to revert it, or...add class_alias( MySQLPMasterPos::class, 'Wikimedia\\Rdbms\\MySQLPrimaryPos' ); to wmf.16 [19:36:51] oh, duh I forgot the problem was across branches [19:36:57] it's definitely that patch that's at fault [19:37:14] legoktm: should i revert, or are you on it? [19:37:57] uh, why don't you [19:38:09] though I think your idea of putting in class_alias to wmf.16 also sounds good to me [19:38:10] the patch does match, both stack traces urbanecm linked reference a line in LoadBalancer which gets a DBPrimaryPos object from cache [19:39:28] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709815 uploaded for the revert [19:39:37] wmf.17 has an alias for PrimaryPos -> MasterPos, but " i also see some instances of the error for mediawikiwiki, which is group0 IIRC", so not sure if that would fix it [19:40:14] there were just three of them that i saw -- might be a hiccup related to deployment (i don't see immediately how that would happen, but i wouldn't be that surprised) [19:40:41] filed as https://phabricator.wikimedia.org/T287988 fyi [19:42:12] re: reverting vs. a class alias, could it be possible that both the old and new class names are now present in the object caches? [19:42:30] !log otto@deploy1002 Started deploy [analytics/refinery@ea78871]: Regular analytics weekly train [analytics/refinery@ea78871] [19:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:01] dduvall: possible, depends on the expiry. But I'd expect the errors to not stop if that was the case. [19:43:12] ah, good point [19:43:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2002.codfw.wmnet with reason: REIMAGE [19:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:28] +2'd the revert [19:44:36] thanks [19:44:40] (03CR) 10Urbanecm: "This change is ready for review." [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [19:44:52] and i uploaded ^^ for the class alias, if we decide to go that route [19:45:14] err, we just need one right? either the revert or class_alias? [19:45:48] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti-test2002.codfw.wmnet with reason: REIMAGE [19:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:09] legoktm: yes, unless whoever maintains rdbms decides to re-revert and apply the alias [19:46:19] *unrevert [19:46:30] well I'd rather just to the class alias route then [19:47:08] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709789 looks empty though? [19:47:18] (03PS2) 10Urbanecm: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) [19:47:21] forgot to press "publish" [19:47:23] try now legoktm [19:47:57] do you also need to alias DBMasterPos? [19:48:03] the revert fails phan, "undeclared type \Wikimedia\Rdbms\DBPrimaryPos" [19:48:09] I think not... [19:48:23] https://integration.wikimedia.org/ci/job/mediawiki-core-php72-phan-docker/52082/console [19:48:42] (03CR) 10Krinkle: "autoloader as well" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [19:49:09] right [19:49:29] (03PS3) 10Legoktm: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [19:49:31] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:51] (03PS4) 10Urbanecm: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) [19:50:59] (03CR) 10Urbanecm: Add MySQLPrimaryPos as an alias to MySQLMasterPos (031 comment) [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [19:51:32] (03CR) 10Legoktm: [C: 03+1] Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [19:51:46] * legoktm looks at why mwdebug deploy failed [19:52:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:39] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:51] "Error: timed out waiting for the condition" [19:53:31] (03Abandoned) 10Legoktm: arclamp: add svgs for some key entrypoint/singleton methods calls [puppet] - 10https://gerrit.wikimedia.org/r/598292 (https://phabricator.wikimedia.org/T253679) (owner: 10Aaron Schulz) [19:54:29] (03PS5) 10Zabe: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [19:54:44] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti-test2002.codfw.wmnet'] ` and were **ALL** successful. [19:54:49] urbanecm: composer test was failing due to the wrong comment style [19:54:53] thanks zabe [19:56:15] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:57:01] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb={DELETE,LIST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:57:17] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={create,delete,get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:57:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:01] (03PS1) 10Ryan Kemper: analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 [19:59:20] (03PS2) 10Ryan Kemper: analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 [19:59:53] (03CR) 10jerkins-bot: [V: 04-1] analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 (owner: 10Ryan Kemper) [20:01:22] dduvall: should we try again with the wmf.16 hack? [20:01:51] urbanecm: sounds good to me, yes [20:02:00] urbanecm: ci is failing [20:02:17] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:02:36] once they are merged and sync'd i can re-promote group0 [20:03:08] !log otto@deploy1002 Finished deploy [analytics/refinery@ea78871]: Regular analytics weekly train [analytics/refinery@ea78871] (duration: 20m 38s) [20:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:57] i'm not sure what's going on with debug servers atm though. it would be good to get verification that the class aliases solve the issue before rolling to group0 again [20:04:03] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [20:04:16] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [20:04:22] dduvall: you mean the mwdbug-deploy? that's mw on k8s related AFAIK [20:05:09] yeah. i saw trouble with it earlier. legoktm: is the k8s mwdebug deploy ok again? [20:05:20] (03PS6) 10Urbanecm: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) [20:05:23] yeah, or at least it's fine to ignore [20:05:31] k [20:05:40] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ganeti-test2003.codfw.wmnet ` The log can be found in `/var/log/wm... [20:05:52] (03PS7) 10Zabe: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [20:05:55] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:06:12] lol sorry [20:06:14] zabe: why the PS7? [20:07:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10RKemper) jclark found a spare and replaced the drive. will get this host re-imaged and back into service [20:07:56] !log otto@deploy1002 Started deploy [analytics/refinery@ea78871] (thin): Regular analytics weekly train THIN [analytics/refinery@ea78871] [20:08:01] (03PS8) 10Urbanecm: Add MySQLPrimaryPos as an alias to MySQLMasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) [20:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:03] !log otto@deploy1002 Finished deploy [analytics/refinery@ea78871] (thin): Regular analytics weekly train THIN [analytics/refinery@ea78871] (duration: 00m 07s) [20:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:18] !log otto@deploy1002 Started deploy [analytics/refinery@ea78871] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ea78871] [20:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:57] (03PS9) 10Urbanecm: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) [20:09:36] i aliased the interface in PS8 as well (the breaking patch does it as well) [20:11:19] (03CR) 10Legoktm: [C: 03+1] Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [20:11:27] lgtm for you to sync out [20:11:45] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:50] thanks legoktm [20:12:09] and if deploy_to_mwdebug fails, feel free to ignore it, I'll poke at it after lunch [20:12:16] (03CR) 10Urbanecm: [C: 03+2] "approved by lego" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [20:12:21] ack [20:13:27] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:55] !log otto@deploy1002 Finished deploy [analytics/refinery@ea78871] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ea78871] (duration: 05m 36s) [20:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:27] (03PS3) 10Ryan Kemper: analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 [20:18:56] (03CR) 10jerkins-bot: [V: 04-1] analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 (owner: 10Ryan Kemper) [20:19:27] (03PS4) 10Ryan Kemper: analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 [20:20:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2003.codfw.wmnet with reason: REIMAGE [20:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti-test2003.codfw.wmnet with reason: REIMAGE [20:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] PROBLEM - Host cloudvirt1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:28:23] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aarora-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:43] (03CR) 10Ottomata: [C: 03+1] analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 (owner: 10Ryan Kemper) [20:28:45] RECOVERY - Host cloudvirt1038.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 1.95 ms [20:28:53] urbanecm: this might be a dumb question, but why don't we need a backport of the class alias for wmf.17? [20:28:59] (03CR) 10Ryan Kemper: [C: 03+2] analytics web: create wikistats webdir if absent [puppet] - 10https://gerrit.wikimedia.org/r/709817 (owner: 10Ryan Kemper) [20:29:06] dduvall: because it is already aliased there [20:29:20] (03PS1) 10Ladsgroup: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) [20:29:22] oh! ok [20:29:23] the breaking patch introduced aliases for backwards compat [20:29:29] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:32] got it, ok. makes sense [20:29:46] but not for forward compat (ie. in the old code, wmf.16) [20:30:34] (03CR) 10jerkins-bot: [V: 04-1] Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [20:30:38] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti-test2003.codfw.wmnet'] ` and were **ALL** successful. [20:30:59] (03Merged) 10jenkins-bot: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [20:31:48] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10Papaul) [20:32:17] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff ready for you [20:33:18] yay, it merged [20:34:58] pulled to mwdebug2001, browsing the wikis works [20:36:58] (03PS1) 10Ryan Kemper: analytics web: create htdocs subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/709822 [20:37:31] nothing overly suspicious shows in logstash, syncing [20:38:18] excellent. yeah, i don't see anything in the last 5 minutes or so [20:39:05] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/autoload.php: 2d4ea752ec6f412ba071ef46023c978d55afcd98: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 1/3) (duration: 01m 08s) [20:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:17] T287988: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T287988 [20:40:16] shoot i see a new fatal [20:40:28] that might be me [20:40:33] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/includes/libs/rdbms/database/position/DBMasterPos.php: 2d4ea752ec6f412ba071ef46023c978d55afcd98: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 2/3) (duration: 01m 07s) [20:40:37] "Cannot declare class Wikimedia\Rdbms\MySQLMasterPos, because the name is already in use" [20:40:39] [597ef36d-053b-4f7a-ab11-9c36b49e60da] /w/api.php PHP Fatal Error from line 19 of /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/database/position/MySQLMasterPos.php: Cannot declare class Wikimedia\Rdbms\MySQLMasterPos, because the name is already in use [20:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:41] very likely [20:40:42] reverting [20:40:45] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:41:28] ^ mw2383 is depooled, it's T286463 [20:41:29] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [20:41:38] good to know [20:41:42] that's probably a downtime expiring, I'll reup it [20:41:48] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/includes/libs/rdbms/database/position/DBMasterPos.php: REVERT: 2d4ea752ec6f412ba071ef46023c978d55afcd98: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 2/3) (duration: 00m 37s) [20:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:39] why did it fail? was the class_alias statement executed when autoloading both versions? [20:44:21] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/autoload.php: REVERT: 2d4ea752ec6f412ba071ef46023c978d55afcd98: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 1/3) (duration: 00m 37s) [20:44:22] majavah: not sure, but because it didn't stop after the first revert, it's perhaps relevant to autoloader? [20:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:28] T287988: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T287988 [20:45:54] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime for 13 days, 0:00:00 on mw2383.codfw.wmnet with reason: T286463 [20:45:55] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 13 days, 0:00:00 on mw2383.codfw.wmnet with reason: T286463 [20:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:07] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) p:05Medium→03Low [20:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:06] urbanecm: perhaps it needs an `if (class_exists(...` guard since autoloader might try to require the include twice? [20:47:27] maybe? [20:47:31] my php knowledge is very old, however... [20:47:39] legoktm: majavah: Krinkle: any ideas? [20:48:26] Ummmm [20:48:44] The autoloader is require_once [20:48:58] but same path is there twice [20:49:08] (not sure how much that means) [20:49:16] same path can be there multiple times [20:49:25] Is there a full stacktrace? [20:49:31] shoudl be fine, since autoloader shouldn't kick in if the thing in question is already defined [20:49:41] autoloader loads only if the the thing asked for isn't defined yet [20:50:02] legoktm: yes, but it's not very helpful https://www.irccloud.com/pastebin/noZLiDie/ [20:50:27] lol [20:51:41] I'm at a loss on why this wouldn't work [20:51:46] * urbanecm too [20:52:12] and also i'm thinking why it went past the canaries [20:52:14] does `class_alias` itself somehow trigger the autoloader? [20:52:21] I don't understand the deployed patch [20:52:27] It adds new names as aliases for the old classes [20:52:32] Did a revert already land? [20:52:38] Krinkle: it never landed [20:52:39] no, we didn't revert [20:52:47] legoktm: decided to try this approach instead [20:53:05] a revert without anything else fails phan [20:53:06] the idea is that when MW sees a wmf.17 cache entry using the new class name, it maps it back to the wmf.16 class name and unserializes it properly [20:53:07] 10SRE, 10Infrastructure-Foundations, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) [20:53:29] wait, so we're no longer fixing the original train blocker but fixing the fall out from caching new-style entries briefly? [20:53:37] But that doesn't explain the new classes not existing in source [20:53:49] maybe I misunderstood the original problem, but I thought hte issue was we renamed classes and didn't leave an alias. [20:53:57] Krinkle: the new classes don't exist in wmf.16 [20:54:02] the problem is that the cache entries are shared across both branches [20:54:09] the errors started to appear when wmf.17 was deployed, but in wmf.16 wikis [20:54:15] probably because of something that's shared across wikis [20:54:15] and when the class was renamed, it left aliases behind, but the new names weren't availabe on wmf.16 [20:54:22] ack, so the problem was only with wmf.16 seeing new-style entries [20:54:23] ack [20:54:32] yeah, we shoudl have backported that, which is what you did. [20:54:33] cool [20:55:27] dduvall: whenever the autoloader gets a new instantiation of MySQLMasterPos or MySQLPrimaryPos it will load the PHP file, define the class, and the class_alias. At that point any future uses of either class name don't hit the autoloader since both are defined [20:55:38] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) [20:55:42] 709789: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos | https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709789 is reverted at deploy1002 only now, because it caused `PHP Fatal Error: Cannot declare class Wikimedia\Rdbms\MySQLMasterPos, because the name is already in use` to appear widely [20:55:45] dduvall: not by default. if aliases were defined in a different file, then passing autoload=true would mean you trigger the desetination class loading at the alias call. In this case though, autoloader won't trigger either way since the alias is below the class def, so it's resolved and defined either way by then. [20:56:04] and i don't understand _why_ that happens [20:56:52] Did you catch it during mwdebug validation? [20:57:17] Krinkle: no, browsing wikis worked fine at mwdebug and this error was not in logstash for mwdebug [20:57:21] it also went past the canaries [20:57:28] (it=deployment) [20:58:08] i only noticed it when watching mediawiki-new-errors in logstash during the deployment [21:00:51] at this point I'm leaning towards that we just revert it from core for now to unblock the train [21:01:16] legoktm: then we have to figure out why the revert fails CI :/ [21:01:17] hmm, ok. i guess what i was wondering was whether some path like this were possible: "autoloader handles reference to DBPrimaryPos -> does require_once DBMasterPos.php -> class is defined -> class_alias is called referencing DBPrimaryPos (include isn't done yet so maybe not marked as included by autoloader?) -> autoloader handles reference to DBPrimaryPos -> does require_once DBMasterPos -> class is defined but [21:01:18] already exists [21:01:21] I also have to pop into a meeting now [21:02:21] https://3v4l.org/dCEuN [21:02:33] It seems serialize is fine with data in either direction, fwiw [21:03:14] dduvall: no, autoloader is based on class existing first, not based on file loading. loading files is user-land logic in response to autoloader being called. [21:03:49] and in general we've done these exact renames many times before so it's odd that it fails this way. [21:04:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:31] got it. ok. i'll stop adding noise then and let y'all handle it :) [21:05:00] I'll try to reproduce the issue from eval.php [21:05:20] can someone confirm that everything is fine in prod right now? we're off-the-heat now yeah? [21:05:35] Krinkle: afaik, yes, patch is reverted, wmf.17 rollbacked [21:05:43] urbanecm: and gerrit/prod match? [21:05:51] for both branches [21:05:57] not yet, revert of the backport is only at deployment [21:06:05] I'm pushing the revert to gerrit now [21:06:22] (03PS1) 10Urbanecm: Revert "Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709790 (https://phabricator.wikimedia.org/T287988) [21:06:41] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@2d533ba]: enable glent version marker in final output [21:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:53] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "revert" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709790 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [21:07:08] > return class_exists('Wikimedia\\Rdbms\\MySQLPrimaryPos'); [21:07:08] bool(false) [21:07:08] > return class_exists('Wikimedia\\Rdbms\\MySQLMasterPos'); [21:07:08] bool(true) [21:07:08] > [21:07:19] krinkle at mwmaint1002.eqiad.wmnet in ~$ mwscript eval.php --wiki enwiki [21:07:31] so. [21:07:40] we can rule out serialization as a possible related cause [21:07:53] why is that? [21:08:27] Krinkle: i merged the revert of the backport and fetched it at deployment host, prod/gerrit should match now [21:08:40] I'm not sure why it isn't found yet [21:08:47] looking for typos now [21:09:24] why _what_ isn't found? [21:10:13] return class_exists('Wikimedia\\Rdbms\\MySQLPrimaryPos'); [21:10:13] 22:07 bool(false) [21:10:16] This is deterministic [21:10:25] the class doesn't exist on enwiki/wmf.16 with plain class_exists calls [21:10:38] so there is likely an obvious bug somewhere [21:10:43] i don't think it is supposed to, unless you manually applied the backport [21:10:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:15] ok, let's zoom out. [21:12:06] mwcore master defines MySQLPrimaryPos with alias from old name MySQLMasterPos [21:12:19] mwcore wmf.17 is on testwikis only, and has the same [21:12:53] mwcore wmf.16 is on all groups except testwikis, and defines MySQLMasterPos only as normal class [21:13:17] why is this not causing problems? [21:14:02] you mean why the warning from T287988 doesn't still happen? [21:14:02] T287988: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T287988 [21:14:17] indeed [21:14:38] Are we just lucky that testwikis are low traffic and not poisnoing caches for s3 wikis? [21:14:52] that's my bet [21:14:55] OK [21:14:57] Let's assume that [21:15:12] next, when you deployed the addition of new-name aliases to wmf.16 (old branch), it causes more issues. [21:15:28] same warning, same stack trace? [21:15:43] the warning has multiple stacktraces [21:15:49] i am seeing instances of the error where we stand now [21:16:01] and i just noticed what dduvall said [21:16:05] (with wmf.17 on testwikis) but very few [21:16:08] ok [21:16:09] well, way fewer [21:16:35] scrap that (what I said), it's not the same warning/trace. When we deployed the "fix", the new warnings were not about unserializer but about class defining, right. [21:16:36] > Cannot declare class Wikimedia\Rdbms\MySQLMasterPos, because the name is already in use in MySQLMasterPos.php [21:16:41] > [{reqId}] {exception_url} PHP Fatal Error from line 19 of /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/database/position/MySQLMasterPos.php: Cannot declare class Wikimedia\Rdbms\MySQLMasterPos, because the name is already in use [21:16:46] right [21:17:02] yes [21:17:05] and its stacktrace is: #4 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/objectcache/APCUBagOStuff.php(68): apcu_fetch() [21:17:44] sometimes [21:17:51] there's also a shorter version of it https://www.irccloud.com/pastebin/c0FqzeMK/ [21:17:54] (the one i quoted earlier) [21:18:35] ok, can you spot a difference between the two versions, e.g. why is it sometimes an unrecoverable fatal? [21:19:00] i don't know why it happens only sometimes or why it happens at all [21:19:06] (it=double definition fatal) [21:19:07] ok [21:19:23] I'm taking a scap lock, applying the patch and pulling it to mwmaint [21:19:33] ok [21:20:49] $ mwscript eval.php --wiki enwiki [21:20:49] > return class_exists('Wikimedia\\Rdbms\\MySQLPrimaryPos'); bool(true) [21:20:49] > return class_exists('Wikimedia\\Rdbms\\MySQLMasterPos'); bool(true) [21:20:57] so in general it works. [21:21:03] with no fatals/warnings? [21:21:06] indeed [21:21:28] tried in the reverse order in a new cli process as well [21:21:36] trying some serialized forms now [21:23:12] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [21:25:18] > $str = 'C:31:"Wikimedia\Rdbms\MySQLPrimaryPos":124:{a:5:{s:8:"position";s:5:"0.0/0";s:12:"activeDomain";N;s:14:"activeServerId";N;s:16:"activeServerUUID";N;s:8:"asOfTime";i:0;}}'; [21:25:19] > return unserialize($str); [21:25:19] object(Wikimedia\Rdbms\MySQLMasterPos)#562 (8) { [21:25:29] that part works fine as well [21:25:33] (again, each time in a new process) [21:25:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:15] * Krinkle opens backpack, pulls out an "opcache was here" sticker. [21:26:22] haha [21:26:27] We have seen this once before. [21:26:37] But unlikely I suppose given multiple servers etc. [21:27:11] Let me try involving apcu as well [21:27:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:07] pulling the patch to mwdebug1002 as well [21:29:20] Note mwdebug1002 is read only -- shouldn't matter, but mentioning just in case [21:29:45] (DB read only i mean) [21:29:56] right, could've done codfw. Anyway, np for now [21:35:53] https://paste.toolforge.org/view/8b92c52c#Q11XbpMZVv1v8oWHWy9oGI6ovCNpGe91 [21:36:05] https://test.wikipedia.org/w/krinkle.php?mode=store-new / https://test2.wikipedia.org/w/krinkle.php?mode=fetch [21:36:30] seems to work fine there as well [21:37:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:00] changed 0 to mt_rand() to make it easie rto verify it is the same value [21:38:02] and indeed it is [21:38:13] so it transparently becomes an instance of hte older class etc. [21:38:15] as expected/desired [21:38:29] Maybe it's a dumb idea, but couldn't the issue come from autoload.php being synced and MySQLMasterPos not, which was the case when Martin decided to revert? [21:38:37] so yeah.. that brings us closer to opcache or other non-code issues. [21:39:03] (03PS2) 10Ryan Kemper: analytics web: create htdocs subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/709822 [21:39:06] I assumed the issue was persistent in prod [21:39:14] but I don't actually know that for sure [21:39:16] urbanecm: ? [21:39:26] it persistently appeared in logs, but i never managed to actually reproduce it in prod [21:39:33] (03CR) 10jerkins-bot: [V: 04-1] analytics web: create htdocs subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/709822 (owner: 10Ryan Kemper) [21:39:52] also canaries/manual mwdebug browsing gave me a green light [21:39:59] urbanecm: ok, so here's an idea - could we sync just the autoloader file safely right now? [21:40:06] it seems like that couldn't make anything worse, right? [21:40:18] it adds an entry we currently don't (or rarely) look for [21:40:26] and then sync the file change a few minutes later, with revert ready [21:40:41] and if we do look for it, it'd fail with a "class not defined" error [21:41:03] however... [21:41:12] one more question though - when we saw all these class already declared errors - that was with wmf.17 only on test wikis same as now? [21:41:17] ...when i was originally reverting, the errors did NOT stop until i also reverted autoload.php [21:41:47] presumably that would've been just the rare number of "serialise warnigns" turning into "class already declared" or did the volume also increase? [21:42:37] also worth noting, in case it wasn't obvious, tje current low-noise warnings are degrading to a cache miss, so it's not fatal [21:42:46] so the same number of warnigns is not the same severity issue per se. [21:43:03] the class already declared is a fatal though [21:43:06] (was) [21:43:14] exactly, so swaping one for the other is still a problem. [21:43:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:18] i see exactly 1 ocurance of the serialize warning during the minute of 21:40 UTC, and the "class not declared" fatal was appearing about 50/minute [21:45:21] (03PS3) 10Ryan Kemper: analytics web: create htdocs subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/709822 [21:45:29] so the volume also increased [21:45:41] * legoktm is back, reading scrollback [21:45:55] to answer Krinkle's question above [was consulting logstash to ensure my memory isn't faulty] [21:45:58] ok, that actually could make sense. because the warning means a cache miss, so the non-testwikis instantly correct the value [21:46:08] whereas the fatal means group-0/1/2 are stuck [21:46:49] I'll give legoktm a few minutes before doing the last thing I can think of which is to just do it again, but one file at a time. [21:47:05] you can declare me insane after proving me wrong [21:47:31] it's more likely you're both insane and right [21:47:36] ;-) [21:48:21] syncing autoload.php first seems sensible [21:48:27] i did that originally, ftr [21:48:36] oh [21:48:43] (i used scap sync-file to get it out, but without waiting after syncing each file) [21:49:36] so then what order are you proposing to sync it in now Krinkle? [21:50:08] i did php-1.37.0-wmf.16/autoload.php, then php-1.37.0-wmf.16/includes/libs/rdbms/database/position/DBMasterPos.php, then i noticed the new fatal, and then i reverted it in the exact opposite order [21:50:26] legoktm: Good question. When I suggested this strategy I had not yet fully realized that the current warning is non-fatal. so adding the entry to the autoloader first would not actually be a no-op. [21:50:54] In fact, adding the alias first seems like the better strategy [21:51:00] syncing the class_aliases first seems like it should work [21:51:04] right [21:51:20] that may very well have been the problem. [21:51:40] either 1) the file has already been loaded and the class alias enabled, so no warnings, it all just works, 2) the file has not been loaded, no class alias, not in autoloader, so warning (status quo) [21:51:41] right [21:51:42] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@2d533ba]: enable glent version marker in final output (duration: 45m 00s) [21:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:58] urbanecm: do you want the honours? [21:52:05] * Krinkle unlocks [21:52:12] Krinkle: sure [21:52:35] the wmf.16 directly is currently dirty with my patch re-cherry-picked [21:52:38] directory* [21:52:45] up to you how you want to apply it [21:52:54] (03PS1) 10Urbanecm: Revert "Revert "Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos"" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709791 (https://phabricator.wikimedia.org/T287988) [21:53:29] going to do it through gerrit to avoid unintentionally pushing the sec patch to gerrit [21:53:30] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [21:53:43] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos"" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709791 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [21:54:41] removed the cherry-pick and waiting for the backport to get merged [21:55:22] Krinkle: just to make sure: the decision is to sync DBMasterPos.php and MySQLMasterPos.php first and _then_ autoload, right? [21:56:56] +1 [22:02:04] urbanecm: yes [22:02:09] thanks [22:05:52] /w/index.php?title=MediaWiki:DoubleWiki.js&action=raw&ctype=text/javascript PHP Warning: Class __PHP_Incomplete_Class has no unserializer [22:05:53] /w/index.php?title=MediaWiki:Common.js/top.js&action=raw&ctype=text/javascript PHP Warning: Class __PHP_Incomplete_Class has no unserializer [22:05:56] on an unrelated note... [22:06:07] ... one can wonder why it is that db positions are exchanged at all on requests like these [22:07:04] not sure [22:07:12] because those requests need to check if page exists? [22:07:30] I vaguely recall that we currently fallback when there is no cookie/queryparam for cpPosIndex to reverse-engienering the key and then fetching it and then applying it anyway. [22:07:38] well, this is for chronology protector [22:07:54] right [22:08:01] it loads the last-known position for your session from redis and then calls doWait() to make sure each subsequent request is as new or newer than the previous [22:08:18] but it seems like we could potentially skip all that for the majority of get requests where you haven't done any writes recently [22:09:23] I thnik that fall back exists for an edge case relating to multi-dc that may or may not be applicalbe anymore [22:09:32] it certainly hasn't applied to prod at any point so far given we're not multi-dc yet [22:12:32] (03Merged) 10jenkins-bot: Revert "Revert "Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos"" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709791 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [22:12:35] o/ [22:15:52] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/includes/libs/rdbms/database/position/DBMasterPos.php: 7d286dc0feaef354943a70ee18014d55cbb2aefa: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 1/3) (duration: 01m 07s) [22:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:00] T287988: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T287988 [22:17:52] ok, here we go. watching the logs [22:17:58] * urbanecm too [22:18:20] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/includes/libs/rdbms/database/position/MySQLMasterPos.php: 7d286dc0feaef354943a70ee18014d55cbb2aefa: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 2/3) (duration: 01m 07s) [22:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:40] and now the autoload [22:20:01] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/autoload.php: 7d286dc0feaef354943a70ee18014d55cbb2aefa: Add (MySQL/DB)PrimaryPos as an alias to (MySQL/DB)MasterPos (T287988; 3/3) (duration: 01m 07s) [22:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:07] and let's hope [22:22:36] serialization warning stopped appearing, no new errors that i can see [22:22:43] either it took down logstash, or it fixed all errors for 5 minutes [22:22:57] heheh [22:23:03] * dduvall hopes the latter? [22:23:06] that would be quite an accomplishment :D [22:23:10] * Krinkle disables "Ignore: Timeout/OOM [22:23:13] ok, those are still there [22:23:19] good, so we're done here [22:23:31] thanks for helping with investigating this Krinkle [22:23:36] thanks so much Krinkle and urbanecm and legoktm [22:23:42] (and legoktm and everyone else who helped) [22:24:00] wheee [22:24:57] i feel fine about rolling group0 if others are [22:25:14] +1 [22:26:28] (03PS1) 10Dduvall: group0 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709839 [22:26:30] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709839 (owner: 10Dduvall) [22:26:36] +1 [22:26:37] here we go :) [22:26:43] let's hope! [22:27:12] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709839 (owner: 10Dduvall) [22:28:35] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.17 [22:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:18] RECOVERY - Ensure local MW versions match expected deployment on mw2383 is OK: OKAY: Not alerting due to fresh production wikiversions: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [22:33:56] so far so good [22:37:16] !log re-rolled 1.37.0-wmf.17 to group0 following rollback and subsequent fixes for T287988 (T281158) [22:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:24] T281158: 1.37.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T281158 [22:37:25] T287988: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T287988 [22:43:18] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [22:43:59] https://user.fm/files/v2-95fc83756e4d2385ad99e6c6dc881906/opcache_was_here.jpg [22:48:40] Krinkle: it's not opcache caused though -- or am i missing something? [22:49:00] the only thing that went wrong is that i synced autoload first, as far as i understand this [22:49:18] I left a longish comment on the original core change [22:49:24] (unrelated to opcache discussion) [22:50:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:15] urbanecm: I'd say 50-50. Given it persisted it may also have been a sticky caching issue. [22:57:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210803T2300). [23:00:05] ebernhardson: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:03:47] \o [23:03:54] i can ship, if things are clear? (looks like it) [23:05:34] afaik things are clear [23:07:06] (03CR) 10Ebernhardson: [C: 03+2] Re-enable commonswiki sister search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709770 (https://phabricator.wikimedia.org/T277225) (owner: 10Ebernhardson) [23:08:27] (03Merged) 10jenkins-bot: Re-enable commonswiki sister search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709770 (https://phabricator.wikimedia.org/T277225) (owner: 10Ebernhardson) [23:12:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:10] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:26] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:709770|Re-enable commonswiki sister search (T277225)]] (duration: 01m 07s) [23:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:33] T277225: Reenable commonswiki sister search - https://phabricator.wikimedia.org/T277225 [23:20:54] all complete [23:28:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log