[00:42:40] 10SRE: Unable to confirm email address on beta cluster - https://phabricator.wikimedia.org/T285527 (10ppelberg) [02:06:08] @steward [02:40:23] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:59:17] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:34:44] 10SRE, 10DBA, 10Datacenter-Switchover: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Marostegui) Treating it like parsercache would also be my first approach. I would be comfortable with doing RO on both DCs, then the switchover and then the RW on both DC... [04:43:55] (03PS1) 10Marostegui: site.pp: Specify which wikis are on s6. [puppet] - 10https://gerrit.wikimedia.org/r/701469 [04:44:34] (03CR) 10Marostegui: [C: 03+2] site.pp: Specify which wikis are on s6. [puppet] - 10https://gerrit.wikimedia.org/r/701469 (owner: 10Marostegui) [04:58:08] (03PS1) 10Legoktm: Revert "mysql_legacy.py: Add x2" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) [06:13:21] (03CR) 10Jcrespo: Revert "mysql_legacy.py: Add x2" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [06:17:52] (03CR) 10Jcrespo: "I answer myself, I think what is inteded is to restrict things to db* hosts with:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [06:21:51] (03PS1) 10Legoktm: mysql_legacy: Allow excluding sections from set_core_masters_readonly() [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) [06:21:56] (03PS1) 10Legoktm: sre.switchdc.mediawiki: Handle x2 specially [cookbooks] - 10https://gerrit.wikimedia.org/r/701475 (https://phabricator.wikimedia.org/T285519) [06:23:31] (03CR) 10Legoktm: Revert "mysql_legacy.py: Add x2" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [06:27:20] (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy: Allow excluding sections from set_core_masters_readonly() [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [06:29:02] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) Summary from `#wikimedia-databases`: `lang=irc 22:00:46 I am fine either way, I would prefer to set x2 as read-only and then... [06:40:46] (03CR) 10Jcrespo: Revert "mysql_legacy.py: Add x2" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [06:43:52] (03CR) 10Jcrespo: Revert "mysql_legacy.py: Add x2" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [06:50:54] (03CR) 10Jcrespo: [C: 03+1] "In other words, ignore my comments as I was confusing x1 and pc1." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210625T0700) [07:08:52] (03CR) 10JMeybohm: add job to weekly rebuild production-images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [07:09:40] (03CR) 10Jcrespo: "I don't think this is needed. It is true it is not yet clear x2 will be read-only on the passive dc, as that will depend on how the applic" [cookbooks] - 10https://gerrit.wikimedia.org/r/701475 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [07:17:04] !log installing openjdk-8-dbg on wdqs1005 to debug blazegraph [07:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:32] !log depool and restart blazegraph on wdqs1005 [07:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:19] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:34:00] (03PS1) 10Muehlenhoff: Extend access for etadros [puppet] - 10https://gerrit.wikimedia.org/r/701477 [07:34:07] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:35:08] 10SRE, 10observability: mtail testing infrastructure does not report Runtime errors - https://phabricator.wikimedia.org/T285533 (10ema) [07:36:33] (03PS2) 10Muehlenhoff: Extend access for etadros [puppet] - 10https://gerrit.wikimedia.org/r/701477 [07:37:23] (03CR) 10Legoktm: "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/701475 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [07:40:32] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for etadros [puppet] - 10https://gerrit.wikimedia.org/r/701477 (owner: 10Muehlenhoff) [07:40:45] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:08] (03Abandoned) 10Muehlenhoff: remove Lars Wirzenius (liw) from groups [puppet] - 10https://gerrit.wikimedia.org/r/700260 (owner: 10Lars Wirzenius) [07:42:57] !log imported Jenkins 2.289.1 to thirdparty/ci for buster-wikimedia T285531 [07:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:03] T285531: Upgrade Jenkins to 2.289.x - https://phabricator.wikimedia.org/T285531 [07:43:07] 10SRE, 10observability, 10good first task: mtail testing infrastructure prints python deprecation warnings - https://phabricator.wikimedia.org/T285534 (10ema) [07:53:35] (03PS4) 10Ema: varnish: add counters for Varnish SLI [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) [07:56:32] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10jcrespo) Legoktm asked me to copy comments I had given on the patches here- I think Manuel had already spoken my mind already- it is for the applica... [07:57:24] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1003.eqiad.wmnet [07:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1003.eqiad.wmnet [07:58:35] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1002.eqiad.wmnet [07:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1002.eqiad.wmnet [08:00:46] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1001.eqiad.wmnet [08:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:20] !log reboot an-worker1101 to unblock stuck GPU [08:01:22] sigh [08:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:26] win 7 [08:04:02] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1001.eqiad.wmnet [08:04:03] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl1002.eqiad.wmnet [08:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:55] PROBLEM - Host an-worker1101 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:59] this is me --^ [08:05:11] I had to powercycle again manually, stuck while booting [08:07:12] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet [08:07:13] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl1001.eqiad.wmnet [08:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:55] RECOVERY - Host an-worker1101 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [08:12:45] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet [08:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:03] (03PS1) 10Giuseppe Lavagetto: switchdc: add a few new excluded services. [cookbooks] - 10https://gerrit.wikimedia.org/r/701484 [08:37:14] (03PS1) 10JMeybohm: switchdc: helm-charts should be switched [cookbooks] - 10https://gerrit.wikimedia.org/r/701488 [08:37:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] switchdc: helm-charts should be switched [cookbooks] - 10https://gerrit.wikimedia.org/r/701488 (owner: 10JMeybohm) [08:38:17] 10SRE, 10Infrastructure-Foundations, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc: systemctl disable command failed, because units were already gone - https://phabricator.wikimedia.org/T285524 (10cmooney) AFAIK if you run "list-units" with --all it will still list enabled services in a **s... [08:38:47] (03CR) 10JMeybohm: [C: 03+1] switchdc: add a few new excluded services. [cookbooks] - 10https://gerrit.wikimedia.org/r/701484 (owner: 10Giuseppe Lavagetto) [08:39:29] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10jbond) @DannyH are you able to approve this access request? [08:39:39] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10jbond) p:05Triage→03Medium [08:40:18] 10SRE, 10serviceops, 10Datacenter-Switchover: Various services hardcode api.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T285518 (10Joe) This is not an issue because that codesearch mostly finds stuff that in not really hardcoded in production, where we use envoy to have services talk to each other, a... [08:40:43] 10SRE, 10serviceops, 10Datacenter-Switchover: Various services hardcode api.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T285518 (10Joe) 05Open→03Invalid [08:42:00] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [08:47:31] (03CR) 10JMeybohm: [C: 03+2] switchdc: add a few new excluded services. [cookbooks] - 10https://gerrit.wikimedia.org/r/701484 (owner: 10Giuseppe Lavagetto) [08:47:34] (03CR) 10JMeybohm: [C: 03+2] switchdc: helm-charts should be switched [cookbooks] - 10https://gerrit.wikimedia.org/r/701488 (owner: 10JMeybohm) [08:48:49] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2003.codfw.wmnet [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:37] (03Merged) 10jenkins-bot: switchdc: add a few new excluded services. [cookbooks] - 10https://gerrit.wikimedia.org/r/701484 (owner: 10Giuseppe Lavagetto) [08:50:39] (03Merged) 10jenkins-bot: switchdc: helm-charts should be switched [cookbooks] - 10https://gerrit.wikimedia.org/r/701488 (owner: 10JMeybohm) [08:51:07] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2003.codfw.wmnet [08:51:07] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2002.codfw.wmnet [08:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:17] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2002.codfw.wmnet [08:52:18] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2001.codfw.wmnet [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:58] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2001.codfw.wmnet [08:54:59] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2002.codfw.wmnet [08:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:30] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2002.codfw.wmnet [09:02:31] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2001.codfw.wmnet [09:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet [09:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:49] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on cloudcontrol1003.wikimedia.org with reason: Known issue, working on it [09:13:49] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on cloudcontrol1003.wikimedia.org with reason: Known issue, working on it [09:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:28] (03CR) 10Muehlenhoff: "While the patch is correct, the question is whether it does the right thing everyone expects? Right now systemctl mask overrules Puppet an" [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [09:15:26] (03CR) 10Legoktm: switchdc: add a few new excluded services. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701484 (owner: 10Giuseppe Lavagetto) [09:15:31] (03PS1) 10Vgutierrez: Release 8.0.8-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) [09:15:55] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol[1003-1005].wikimedia.org with reason: openstack issue [09:15:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol[1003-1005].wikimedia.org with reason: openstack issue [09:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [09:21:46] (03PS2) 10Vgutierrez: Release 8.0.8-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) [09:22:16] (03CR) 10Ema: [C: 03+1] Release 8.0.8-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) (owner: 10Vgutierrez) [09:36:32] (03CR) 10Muehlenhoff: [C: 03+1] "Mapping of commits to CVEs looks fine." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) (owner: 10Vgutierrez) [09:41:18] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.8-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) (owner: 10Vgutierrez) [09:41:30] hmmm [09:43:14] lintian messing with me :) [09:43:42] upgrade to volintian today, get your free copy whereever volan.s exists! [09:47:06] so lintian deprecated the override no-upstream-changelog? [09:49:37] \o has anyone seen this type of error before for PHP scripts running as cron jobs? T285538 [09:49:38] T285538: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/MWWikiversions.php:94 - https://phabricator.wikimedia.org/T285538 [09:54:43] (03CR) 10Kormat: [C: 03+1] "LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [09:57:08] vgutierrez: I'd simply remove lintian there, it seems that CI job is pulling in lintian from buster-backports, so it's also constantly moving target [09:58:01] lintian is useful for uploads to the Debian unstable archive, for anything else it can provide a few hints, but not more [10:00:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scaffold: The metrics-config is only needed if statsd is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/700319 (owner: 10JMeybohm) [10:00:42] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Kormat) >>! In T285519#7176764, @Legoktm wrote: > * [[https://gerrit.wikimedia.org/r/701471|spicerack: Revert "mysql_legacy.py: Add x2"]]: this basi... [10:02:05] (03PS1) 10Jbond: (WIP) sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [10:02:19] (03PS1) 10Cathal Mooney: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) [10:02:48] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting Wikibase client changesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701500 (https://phabricator.wikimedia.org/T257260) [10:02:50] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientChangesDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701501 (https://phabricator.wikimedia.org/T257260) [10:02:52] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting Wikibase repo foreignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701502 (https://phabricator.wikimedia.org/T257260) [10:02:54] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseRepoForeignRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701503 (https://phabricator.wikimedia.org/T257260) [10:02:56] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting Wikibase client repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701504 (https://phabricator.wikimedia.org/T257260) [10:02:58] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientRepoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701505 (https://phabricator.wikimedia.org/T257260) [10:03:16] ^ probably waiting until Monday before I deploy these ^^ [10:04:43] moritzm: to be fair no-upstream-changelog has been dropped in 2018 :) [10:04:50] (03CR) 10jerkins-bot: [V: 04-1] (WIP) sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [10:10:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] add job to weekly rebuild production-images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [10:11:40] (03PS1) 10David Caro: wmcs-dns-floating-ip-updater: do a more granular retry [puppet] - 10https://gerrit.wikimedia.org/r/701506 (https://phabricator.wikimedia.org/T285537) [10:15:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/700922 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [10:17:32] (03CR) 10Muehlenhoff: [C: 03+1] P:logoutd: create wrapper script for calling logout.d scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700922 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [10:19:31] (03PS3) 10Vgutierrez: Release 8.0.8-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) [10:26:23] (03CR) 10Vgutierrez: [C: 03+1] Enable profile::nginx for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/698509 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:28:03] (03CR) 10Vgutierrez: [C: 03+1] Switch acmechief-test1001 to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698510 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:31:18] (03PS1) 10Giuseppe Lavagetto: mediawiki: do not add our CAs artificially [deployment-charts] - 10https://gerrit.wikimedia.org/r/701511 (https://phabricator.wikimedia.org/T284417) [10:34:15] (03PS1) 10Muehlenhoff: Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) [10:35:34] (03CR) 10jerkins-bot: [V: 04-1] Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [10:36:08] (03PS2) 10Jgiannelos: Add blubber variant for tile pregeneration image [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 [10:44:05] (03PS2) 10Muehlenhoff: Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) [10:45:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [10:46:03] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/701493 (https://phabricator.wikimedia.org/T285535) (owner: 10Vgutierrez) [10:48:47] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10IN) >>! In T285160#7164676, @Aklapper wrote: > @IN: Please summarize outcomes when adding comments, and please... [10:51:17] (03PS1) 10David Caro: wmcs.labs-ip-alias-dump: add a retry [puppet] - 10https://gerrit.wikimedia.org/r/701515 (https://phabricator.wikimedia.org/T285537) [10:51:50] (03CR) 10jerkins-bot: [V: 04-1] wmcs.labs-ip-alias-dump: add a retry [puppet] - 10https://gerrit.wikimedia.org/r/701515 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [10:53:08] (03CR) 10MSantos: [C: 03+2] Add blubber variant for tile pregeneration image (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 (owner: 10Jgiannelos) [10:54:09] (03Merged) 10jenkins-bot: Add blubber variant for tile pregeneration image [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 (owner: 10Jgiannelos) [11:16:32] (03PS2) 10Cathal Mooney: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) [11:28:03] !log installing 4.19.194 kernels on Buster from latest 10.10 point release (no reboots, just rolling out the packages) [11:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:04] 10Puppet, 10Infrastructure-Foundations, 10User-jbond, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10jbond) [11:37:12] 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10jbond) [11:40:04] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Investigate use of Puppet "environments" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370 (10jbond) [11:41:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Add CI check to ensure defaults exist in cloud.yaml - https://phabricator.wikimedia.org/T248994 (10jbond) [11:46:04] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Majavah) [11:46:17] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Majavah) [11:46:33] 10Puppet, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10jbond) [11:46:41] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [11:54:07] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/MWWikiversions.php:94 - https://phabricator.wikimedia.org/T285538 (10Tgr) p:05High→03Unbreak! It see... [11:54:43] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [11:57:05] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [11:59:10] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [12:02:26] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:03:57] 10SRE, 10Infrastructure-Foundations, 10Packaging: Debian-glue doesn't check for the validity of the distribution in the changelog. - https://phabricator.wikimedia.org/T252619 (10hashar) [12:14:49] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:16:41] (03PS1) 10Jelto: systemd::timer::job: add parameter to define denendency using after [puppet] - 10https://gerrit.wikimedia.org/r/701525 (https://phabricator.wikimedia.org/T284431) [12:16:59] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:19:11] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:19:28] Hi everyone, I'm trying to debug a weird bug in prod, and I'm under the impression that the code in prod is not what we have in git. It seems like a few patches from December weren't applied. Is anyone with deployment access wiling to help with that? I can share additional details in a PM, in case this has to do with security stuff [12:22:24] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:23:04] (03PS1) 10David Caro: cinderutils.ensure: tweak further the fs usage limit [puppet] - 10https://gerrit.wikimedia.org/r/701528 [12:23:41] (03Abandoned) 10Jgiannelos: Add blubber variant for tile pregeneration image [software/tegola] - 10https://gerrit.wikimedia.org/r/701372 (owner: 10Jgiannelos) [12:24:18] (03PS1) 10Jgiannelos: Improve tegola pregeneration image [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701529 [12:25:40] (03PS2) 10Jgiannelos: Improve tegola pregeneration image [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701529 [12:28:03] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [12:28:33] (03CR) 10Gergő Tisza: "Caused T285538." [puppet] - 10https://gerrit.wikimedia.org/r/701164 (owner: 10Legoktm) [12:28:36] !log installing nmal bugfix update from Buster point release [12:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:41] !log installing nmap bugfix update from Buster point release [12:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] (03PS1) 10JMeybohm: dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T264209) [12:30:03] (03CR) 10Jgiannelos: "After some trial and error and it looks like we can remove the insecure flag since we only want to write to /tmp. Also I added an entrypoi" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701529 (owner: 10Jgiannelos) [12:30:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cinderutils.ensure: tweak further the fs usage limit [puppet] - 10https://gerrit.wikimedia.org/r/701528 (owner: 10David Caro) [12:30:50] (03CR) 10Jelto: "Could you please take a look? This change adds a optional after parameter to systemd::timer::job." [puppet] - 10https://gerrit.wikimedia.org/r/701525 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [12:31:34] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:32:01] (03CR) 10David Caro: [C: 03+2] cinderutils.ensure: tweak further the fs usage limit [puppet] - 10https://gerrit.wikimedia.org/r/701528 (owner: 10David Caro) [12:35:00] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:35:06] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: Port mw-cli-wrapper to Python" [puppet] - 10https://gerrit.wikimedia.org/r/701415 [12:35:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "mediawiki: Port mw-cli-wrapper to Python" [puppet] - 10https://gerrit.wikimedia.org/r/701415 (owner: 10Giuseppe Lavagetto) [12:37:42] (03CR) 10Gergő Tisza: mediawiki: Port mw-cli-wrapper to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701164 (owner: 10Legoktm) [12:39:46] (03Abandoned) 10Giuseppe Lavagetto: Revert "mediawiki: Port mw-cli-wrapper to Python" [puppet] - 10https://gerrit.wikimedia.org/r/701415 (owner: 10Giuseppe Lavagetto) [12:40:47] (03CR) 10Jbond: [C: 03+2] postgresql: don't get replica status if version is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/701428 (owner: 10Hnowlan) [12:46:29] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [12:47:31] (03CR) 10Gergő Tisza: mediawiki: Port mw-cli-wrapper to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701164 (owner: 10Legoktm) [12:47:35] (03Restored) 10Giuseppe Lavagetto: Revert "mediawiki: Port mw-cli-wrapper to Python" [puppet] - 10https://gerrit.wikimedia.org/r/701415 (owner: 10Giuseppe Lavagetto) [12:48:33] (03PS2) 10Giuseppe Lavagetto: Revert "mediawiki: Port mw-cli-wrapper to Python" [puppet] - 10https://gerrit.wikimedia.org/r/701415 [12:50:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "mediawiki: Port mw-cli-wrapper to Python" [puppet] - 10https://gerrit.wikimedia.org/r/701415 (owner: 10Giuseppe Lavagetto) [12:54:33] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [12:56:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:00:58] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [13:04:45] (03PS1) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 [13:06:42] !log upload trafficserver 8.0.8-1wm4 to apt.wm.o (buster) - T285535 [13:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701536 (owner: 10Muehlenhoff) [13:07:21] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 (owner: 10Muehlenhoff) [13:08:26] (03PS1) 10Muehlenhoff: Remove obsolete Tor Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/701537 [13:08:37] !log update ATS to version 8.0.8-1wm4 on cp4026 and cp4032 - T285535 [13:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove obsolete Tor Puppet classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701537 (owner: 10Muehlenhoff) [13:12:44] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [13:14:31] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "mediawiki: Port mw-cli-wrapper to Python"" [puppet] - 10https://gerrit.wikimedia.org/r/701416 [13:15:31] (03PS1) 10Jbond: debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 [13:15:33] (03PS1) 10Jbond: P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 [13:16:23] (03CR) 10jerkins-bot: [V: 04-1] debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [13:17:43] (03CR) 10jerkins-bot: [V: 04-1] P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 (owner: 10Jbond) [13:22:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [13:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:54] (03PS2) 10Giuseppe Lavagetto: Revert "Revert "mediawiki: Port mw-cli-wrapper to Python"" [puppet] - 10https://gerrit.wikimedia.org/r/701416 [13:26:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [13:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:39] (03PS10) 10Elukey: Add the custom_deploy.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [13:26:41] (03PS10) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [13:26:43] (03PS3) 10Elukey: WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [13:27:09] 10SRE, 10GitLab, 10serviceops, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Jelto) [13:27:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "mediawiki: Port mw-cli-wrapper to Python"" [puppet] - 10https://gerrit.wikimedia.org/r/701416 (owner: 10Giuseppe Lavagetto) [13:27:22] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:28:10] (03CR) 10Elukey: "I added a TODO section with Joe's comments, and I have also changed a little the istio config after some tests:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:28:43] (03CR) 10Elukey: "Need to add the possibility to configure https for the istio ingress gateway, will work on it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [13:28:55] (03PS1) 10Hashar: scap config for mediawiki/tools/releases [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) [13:29:37] (03CR) 10RLazarus: [C: 03+1] Revert "Revert "mediawiki: Port mw-cli-wrapper to Python"" [puppet] - 10https://gerrit.wikimedia.org/r/701416 (owner: 10Giuseppe Lavagetto) [13:29:39] (03CR) 10Hashar: [C: 04-1] "Of course we can't have the deployable repo at /srv/deployment/mediawiki/release AND deploy it to the same directory on the deployment ser" [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) (owner: 10Hashar) [13:31:11] (03PS1) 10Jbond: C:trafficserver: use debian::autostart to prevent auto service start [puppet] - 10https://gerrit.wikimedia.org/r/701545 [13:31:13] (03PS1) 10Jbond: systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 [13:31:15] (03PS1) 10Jbond: systemd::umask: drop systemd::umask [puppet] - 10https://gerrit.wikimedia.org/r/701547 [13:32:37] (03PS2) 10Jbond: debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 [13:33:03] (03PS3) 10Giuseppe Lavagetto: Revert "Revert "mediawiki: Port mw-cli-wrapper to Python"" [puppet] - 10https://gerrit.wikimedia.org/r/701416 [13:33:23] (03CR) 10jerkins-bot: [V: 04-1] systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 (owner: 10Jbond) [13:33:43] (03CR) 10Ssingh: [V: 03+1] Remove obsolete Tor Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/701537 (owner: 10Muehlenhoff) [13:34:13] (03PS2) 10Jbond: P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 [13:34:21] (03PS2) 10Jbond: C:trafficserver: use debian::autostart to prevent auto service start [puppet] - 10https://gerrit.wikimedia.org/r/701545 [13:34:30] (03PS2) 10Jbond: systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 [13:36:02] (03CR) 10jerkins-bot: [V: 04-1] P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 (owner: 10Jbond) [13:36:13] (03PS3) 10Jbond: systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 [13:36:22] (03PS2) 10Jbond: systemd::umask: drop systemd::umask [puppet] - 10https://gerrit.wikimedia.org/r/701547 [13:37:09] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:45] (03PS3) 10Jbond: debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 [13:37:54] (03PS3) 10Jbond: P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 [13:38:04] (03CR) 10jerkins-bot: [V: 04-1] systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 (owner: 10Jbond) [13:38:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert "mediawiki: Port mw-cli-wrapper to Python"" [puppet] - 10https://gerrit.wikimedia.org/r/701416 (owner: 10Giuseppe Lavagetto) [13:40:40] (03PS1) 10David Caro: wmcs-prepare-cinder-volume: use the correct fstype for fstab [puppet] - 10https://gerrit.wikimedia.org/r/701549 [13:42:06] (03CR) 10Andrew Bogott: [C: 03+1] wmcs-prepare-cinder-volume: use the correct fstype for fstab [puppet] - 10https://gerrit.wikimedia.org/r/701549 (owner: 10David Caro) [13:42:10] (03PS3) 10Jbond: C:trafficserver: use debian::autostart to prevent auto service start [puppet] - 10https://gerrit.wikimedia.org/r/701545 [13:42:46] (03PS4) 10Jbond: systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 [13:42:53] (03PS3) 10Jbond: systemd::umask: drop systemd::umask [puppet] - 10https://gerrit.wikimedia.org/r/701547 [13:43:29] (03PS4) 10Jbond: systemd::umask: drop systemd::umask [puppet] - 10https://gerrit.wikimedia.org/r/701547 [13:44:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10ayounsi) +1 to me. As the end goal is *less* filtering for WMCS hosts, it's a win-win. To me the next steps are: * Improve htt... [13:45:58] (03CR) 10David Caro: [C: 03+2] wmcs-prepare-cinder-volume: use the correct fstype for fstab [puppet] - 10https://gerrit.wikimedia.org/r/701549 (owner: 10David Caro) [13:50:35] !log jelto@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab2001.wikimedia.org [13:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:03] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [13:59:55] (03PS7) 10Jbond: P:logoutd: create wrapper script for calling logout.d scripts [puppet] - 10https://gerrit.wikimedia.org/r/700922 (https://phabricator.wikimedia.org/T283242) [14:00:20] (03CR) 10Jbond: P:logoutd: create wrapper script for calling logout.d scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700922 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:04:41] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.4154 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:05:22] ^ is that just from unbreaking T285538 and catching up? will keep an eye on it [14:05:23] T285538: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/MWWikiversions.php:94 - https://phabricator.wikimedia.org/T285538 [14:05:42] (if it is, it should recover on its own shortly) [14:06:10] and yep, looks like it peaked and recovered [14:07:22] (03PS2) 10Hashar: scap config for mediawiki/tools/releases [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) [14:07:37] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) (owner: 10Hashar) [14:08:13] (03PS2) 10Muehlenhoff: Remove obsolete Tor Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/701537 [14:08:46] (03CR) 10Muehlenhoff: Remove obsolete Tor Puppet classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701537 (owner: 10Muehlenhoff) [14:09:49] (03PS5) 10Itamar Givon: Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) [14:10:40] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) (owner: 10Hashar) [14:11:48] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Doing): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10hashar) a:05LarsWirzenius→03None [14:11:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/700922 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:16:27] (03CR) 10Jbond: [C: 03+2] P:logoutd: create wrapper script for calling logout.d scripts [puppet] - 10https://gerrit.wikimedia.org/r/700922 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:17:02] (03PS2) 10Jbond: (WIP) sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [14:17:36] (03PS3) 10Hashar: scap config for mediawiki/tools/releases [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) [14:17:46] (03PS1) 10MSantos: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 [14:17:50] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) (owner: 10Hashar) [14:21:23] (03PS1) 10David Caro: cinderutils.ensure: don't add mountpoint if we exec [puppet] - 10https://gerrit.wikimedia.org/r/701559 [14:21:41] 10SRE: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10RLazarus) Per @Joe: This is probably okay in production since everything uses the discovery names instead of talking directly to svc.$DC anyway, but we should get it fixed by regenerating the certs prop... [14:21:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Hm, I think this will crash on Test Wikidata in its current form? Either the variable needs some default in IS.php, or we should check iss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [14:24:22] (03CR) 10jerkins-bot: [V: 04-1] (WIP) sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [14:26:15] (03PS6) 10Lucas Werkmeister (WMDE): Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [14:27:49] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.3231 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:28:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps2007.codfw.wmnet with reason: reimaging as buster replica [14:28:00] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps2007.codfw.wmnet with reason: reimaging as buster replica [14:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:08] (03CR) 10Muehlenhoff: "> For the use case where user's manually run `systemctl mask $unit` i think its correct that puppet comes along and corrects that, in the " [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [14:28:36] (03CR) 10Hashar: [C: 04-1] "Pending analyze of https://puppet-compiler.wmflabs.org/compiler1001/823/deploy1002.eqiad.wmnet/index.html :)" [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) (owner: 10Hashar) [14:30:39] (03CR) 10Elukey: [C: 03+2] "Going to try this on the ml-serve cluster, let's see if it works :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [14:30:45] (03CR) 10Jgiannelos: maps: fix osm sync directory path (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [14:34:22] (03CR) 10Jbond: "> I agree! But that's not necessarily a shared common understanding across all of SRE, so maybe this will need some explicit announcement " [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [14:34:32] what's that appservers alert [14:35:07] it started around 13:00 Z [14:35:27] weird didn't see it [14:35:41] and corresponds to a high latency alert no one noticed :/ [14:36:04] aw man, I was only looking at the 14:03 spike, I missed the earlier increase [14:36:10] let's see [14:37:00] I think it times as usual with a spike in memcached requests, but I wonder why it never recovered [14:37:20] rzl: it's possible it's just a few servers stuck in a bad state, I'd check the cpu usages first [14:37:48] is it appservers eqiad? [14:39:11] joe: https://grafana.wikimedia.org/d/D1DS8IsWk/cdanis-cluster-cpu-skew?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instances=All&from=now-2h&to=now [14:39:16] neat bimodal distribution [14:39:20] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard shows something different though [14:39:21] and cdanis++ for this heatmap [14:39:25] now... what IS that [14:40:16] oh, unless that's the normal pattern just from server models/weighting -- checking [14:40:30] now that I say that it's almost certainly what it is, more coffee required [14:40:39] yeah, same pattern a week ago, ignore [14:40:52] (03PS3) 10Jgiannelos: Unify production server and pregeneration images [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701529 [14:41:35] (03CR) 10Dzahn: [C: 03+1] "+1 (with a heavy heart, sigh)" [puppet] - 10https://gerrit.wikimedia.org/r/701537 (owner: 10Muehlenhoff) [14:43:08] https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&viewPanel=11 [14:43:09] 🤔 [14:43:40] (03PS2) 10David Caro: cinderutils.ensure: don't add mountpoint if we exec [puppet] - 10https://gerrit.wikimedia.org/r/701559 [14:44:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab2001.wikimedia.org [14:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:07] jelto: ^ IRC bot is leaking the info that it finished :) [14:48:15] mutante: I'm preparing the change for dhcp/role, one sec ;) [14:48:27] rzl: not sure if completely unrelated but https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=39&from=now-2d&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 looks very weird [14:48:40] (03PS3) 10Hnowlan: maps: make maps2007 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) [14:49:10] elukey: wow, it sure does [14:49:40] the timing overlaps but doesn't align fully, I'm not sure if it's related either [14:49:41] timing do not really match completely with what we are seeing, but my eye was caught by that wall [14:49:49] yeah it's certainly eye-catching [14:50:57] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.3077 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:52:32] (03PS1) 10Jelto: DHCP and site: add gitlab2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/701565 (https://phabricator.wikimedia.org/T285456) [14:52:51] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:53:14] (03CR) 10Dzahn: [C: 03+1] DHCP and site: add gitlab2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/701565 (https://phabricator.wikimedia.org/T285456) (owner: 10Jelto) [14:53:20] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29999/console" [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [14:53:25] I'm nowhere with this but still looking [14:53:29] jelto: lgtm! [14:53:29] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:55:55] this is very noticeable --^ [14:56:31] yeah, GET latency is also steadily creeping up [14:58:41] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.3385 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:58:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: do not add our CAs artificially [deployment-charts] - 10https://gerrit.wikimedia.org/r/701511 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [15:01:11] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:01:37] (03Merged) 10jenkins-bot: mediawiki: do not add our CAs artificially [deployment-charts] - 10https://gerrit.wikimedia.org/r/701511 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [15:02:37] related, plausibly: Jun 25 15:00:25 mw1387 php7.2-fpm[29779]: [WARNING] [pool www] child 33464, script '/srv/mediawiki/docroot/wikidata.org/w/index.php' (request: "GET /wiki/Special:EntityData/Q128121.rdf") executing too slow (15.249310 sec), logging [15:02:38] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10bd808) [15:03:03] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:03] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:07:08] rzl: wondering if this is normal GET /wiki/Special:EntityData/Q14330.rdf") executing too slow (18.936964 sec) [15:07:22] yeah, just found the same ^ not sure if cause or symptom yet, though [15:07:24] that's the Wikidata special page thing [15:07:30] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10CKoerner_WMF) 05Open→03Resolved a:03CKoerner_WMF I can confirm that I'm now receiving the digest emails. Thanks y'all for the quick response. [15:07:45] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [15:08:01] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [15:08:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add the puppet CA to the MediaWiki deployment - https://phabricator.wikimedia.org/T284417 (10Joe) 05Open→03Resolved [15:08:57] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:13:09] rzl: probably not Wikidata specific, seeing other requests as "too slow" in there as well [15:14:45] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:17:19] 200, 404, 501 moving together in the graphs, not limited to one of them [15:19:03] joe points out some servers are wedged with worker saturation since the 1400Z spike, e.g. https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1320 [15:19:12] I'm going to restart php-fpm there and look for others [15:19:52] makes sense yes [15:20:43] !log rzl@mw1320:~$ sudo restart-php7.2-fpm # workers stuck since the ~14:00 request spike [15:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:58] traffic on appservers (network 5 min avg) is NOT going up [15:22:33] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:24:06] that fixed 1320 [15:24:16] but just a single one can cause this? [15:24:28] (03PS1) 10Herron: rsyslog_recieve: logrotate set maxage and rotate empty logs [puppet] - 10https://gerrit.wikimedia.org/r/701576 (https://phabricator.wikimedia.org/T285371) [15:24:39] probably not :) looking for the rest [15:24:44] rzl: 1319 next [15:24:46] it'll be unevenly distributed but not THAT unevenly [15:26:32] should I do the restart there? [15:27:15] if you like -- I'm still just trying to come up with a complete list so we can cumin it out instead of hitting them one-by-one [15:27:24] rzl: how are you finding the targets? [15:28:12] haven't yet, that's what I'm working on -- I'd be faster if I were better at promql :( [15:28:17] !log [mw1319:~] $ sudo restart-php7.2-fpm [15:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:47] ah right from prometheus, perfect [15:29:01] in Icinga, under "warnings" some show the opcache health check [15:29:08] but besides 1320 the rest are codfw [15:29:27] found 1319 by clicking in grafana one after another [15:31:39] maybe as follow up we could add a graph showing the top 10/20 appservers with busy workers [15:32:12] yeah, was thinking the same [15:32:25] oh I think I had that at some point [15:32:28] ummmm [15:34:37] for example I am watching topk(10, irate(mediawiki_http_requests_duration_sum{cluster="$cluster",handler!="-",instance=~".*:3903"}[5m])) [15:35:26] mw1331 detected [15:35:36] eh mw1332 [15:35:48] both are in the top10 [15:36:32] !log [mw1332:~] $ sudo restart-php7.2-fpm [15:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:28] mw1320, mw1321,mw1322,mw1323,mw1325,mw132[6-9],mw133[0-2] [15:37:32] mw1330 [15:37:38] [mw1332:~] $ sudo restart-php7.2-fpm [15:38:22] mutante: if you want to restart the above ones (missing from your list) otherwise I'll do it [15:38:25] !log [mw1330:~] $ sudo restart-php7.2-fpm [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:43] elukey: ok, can do [15:38:54] rzl: ok if we restart the above list? [15:39:08] perfect, thanks for getting there faster than me [15:39:44] I wish, it was a horrible grafana copy/paste :D [15:39:55] we'll restart some so you can come up with a complete list [15:40:01] and the cumin them all :) [15:40:59] [for later] feel free to comment on T187709 ;) [15:41:00] T187709: Cumin feature idea: Prometheus backend - https://phabricator.wikimedia.org/T187709 [15:41:04] !log mw1330, mw1320, mw1321, mw1322 - restarted php-fpm [15:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] 1323 is not pooled ? [15:42:45] let's also check if they heal or if they get back into busy state [15:42:56] (03CR) 10Hnowlan: "While we're doing these changes, there is a reference of `osmosis_dir` in osm::planet_sync. Should we remove that (and just bake it into t" [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [15:44:00] FYI if you can perform the check locally somehow, you can do cumin 'check command' 'restart command', it will perform the restart only if the check command succeeds [15:44:22] there we go, tweaked elukey's query and got what I was looking for: https://grafana.wikimedia.org/goto/dcgULGz7k [15:44:36] ugh, I have to get more practice at those :( [15:45:03] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04615 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:45:13] looks like: mw13[27,29,28,25,31,26,23] [15:45:15] for 1323 and 1327 I got some warnings like "not pooled, should be enabled/up/pooled" during the restart [15:45:24] but config-master does not agree about that [15:45:37] (in order by saturation, worst one first) [15:46:36] !log mw1326, mw1327, mw1328, mw1329 ... restarted php-fpm [15:46:40] and all those were on your list so I guess the only new finding is "and that's all" [15:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:49:22] I think that's it now? [15:50:08] yeah, I want to watch the graphs a while longer but everything's moving in the right direction [15:50:55] also did 1324 [15:51:45] 1331 showed a value of 99.7 and was also restarted. all others are 37 and lower, ACK [15:52:09] (03PS1) 10Urbanecm: Growth: Enable community configuration at all Growth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701579 (https://phabricator.wikimedia.org/T285423) [15:52:38] 1364 popping up now, adding that [15:53:10] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701576 (https://phabricator.wikimedia.org/T285371) (owner: 10Herron) [15:53:11] but the saturation graph for that looks actually fine [15:53:18] oh, that grafana link went to elukey's original query instead of the one I had open?? rude [15:53:39] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:53:55] or not even, something in between [15:54:02] mutante: popping up where? [15:54:20] rzl: when sorting by "Value #A" in that link you pasted [15:54:30] it was the new leader with around 70 [15:54:53] but that is already over now [15:55:24] ahh okay, yeah looks like a temporary bump and 70 is a much more normal value anyway [15:55:35] everything is under 50 now [15:55:54] most of the noise was in the 200-250 range and the worst offenders were up past 300 [15:56:08] *nod* cool [15:56:28] anyway, looking good [15:56:30] thanks both <3 [15:56:46] the "too slow" log lines are normal, ftr [15:57:13] yeah, I think we were getting a higher rate of them during this, but that also might be my imagination [15:57:21] not really worth doing the analysis to find out [15:57:35] yes [15:57:41] to both of that [15:59:42] nice :) [15:59:58] it is still a bit weird not knowing what caused this [16:01:47] if it was an Icinga check on each host, like the opcache check, we could technically use eventhandler to automatically do the restart, but seems a bit scary too [16:02:18] elukey: so, T285538 was the trigger at least -- when we resolved that and restarted a bunch of maintenance jobs at the same time, that caused the spike in traffic that led to the saturation [16:02:19] T285538: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/MWWikiversions.php:94 - https://phabricator.wikimedia.org/T285538 [16:02:31] why that didn't clear up on its own though, unclear [16:08:03] rzl: ahh okok I lost the info about the trigger, thanks :) [16:11:31] on -tech a user was reporting a time out on https://en.wikipedia.org/wiki/User:Ninjatacoshell/List_of_nodulated_plants_and_their_symbionts but that is 1.6MB of raw wiki text and https://en.wikipedia.org/wiki/Special:LongPages shows that is like 3 times more than the longest article... so ... shrug [16:13:59] (03PS1) 10Andrew Bogott: Nova: format new ephemeral volumes with ext4 [puppet] - 10https://gerrit.wikimedia.org/r/701581 [16:15:01] going for dinner then,laters [16:27:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but please collect +1 from david too." [puppet] - 10https://gerrit.wikimedia.org/r/701581 (owner: 10Andrew Bogott) [16:35:41] 10SRE, 10serviceops, 10Datacenter-Switchover: Various services hardcode api.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T285518 (10Legoktm) 05Invalid→03Open p:05High→03Lowest OK, I think those repos should be cleaned up then to not confuse people. [16:36:19] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1002 job=burrow partition={2,5} prometheus=ops site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datas [16:36:19] anos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:48:30] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Legoktm) Note sure if it's in scope for this task, but the lack of Puppe... [16:49:29] (03CR) 10Legoktm: [C: 03+2] Revert "mysql_legacy.py: Add x2" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [16:57:20] (03Merged) 10jenkins-bot: Revert "mysql_legacy.py: Add x2" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701471 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [17:03:57] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Quiddity) [17:04:02] (03PS1) 10Zabe: [doc] switching from freenode to libera.chat [deployment-charts] - 10https://gerrit.wikimedia.org/r/701591 (https://phabricator.wikimedia.org/T283273) [17:05:05] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:12:00] 10SRE, 10Infrastructure-Foundations, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc: systemctl disable command failed, because units were already gone - https://phabricator.wikimedia.org/T285524 (10Legoktm) a:03Legoktm >>! In T285524#7176824, @cmooney wrote: > AFAIK if you run "list-units"... [17:15:49] (03PS1) 10Legoktm: mediawiki: Remove unnecessary and broken disable of systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701593 (https://phabricator.wikimedia.org/T285524) [17:19:45] (03PS1) 10RLazarus: 00-reduce-ttl: Sleep after updating TTL [cookbooks] - 10https://gerrit.wikimedia.org/r/701594 [17:29:48] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Collect and archive KML/KMZ fiber path files for new and existing network circuits - https://phabricator.wikimedia.org/T285136 (10ayounsi) a:03ayounsi [17:30:50] (03PS2) 10Legoktm: 00-reduce-ttl: Sleep after updating TTL [cookbooks] - 10https://gerrit.wikimedia.org/r/701594 (owner: 10RLazarus) [17:31:09] (03CR) 10Legoktm: [C: 03+2] "LGTM, added how to skip the sleep to the commit message" [cookbooks] - 10https://gerrit.wikimedia.org/r/701594 (owner: 10RLazarus) [17:33:05] 10SRE, 10observability: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [17:34:52] (03Merged) 10jenkins-bot: 00-reduce-ttl: Sleep after updating TTL [cookbooks] - 10https://gerrit.wikimedia.org/r/701594 (owner: 10RLazarus) [17:35:12] (03CR) 10RLazarus: [C: 03+1] mediawiki: Remove unnecessary and broken disable of systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701593 (https://phabricator.wikimedia.org/T285524) (owner: 10Legoktm) [17:36:43] (03CR) 10Urbanecm: [C: 03+1] "thanks for noticing this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/701591 (https://phabricator.wikimedia.org/T283273) (owner: 10Zabe) [17:37:03] (I'd merge, but I don't know what a merge on deployment-charts does) [17:37:11] 10SRE, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [17:38:23] (03CR) 10RLazarus: [C: 03+1] swift: Only run swiftrepl-mw in the active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/701052 (https://phabricator.wikimedia.org/T285373) (owner: 10Legoktm) [17:39:22] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [17:39:46] 10SRE, 10SRE-OnFire, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [17:40:41] (03CR) 10Legoktm: [C: 03+2] [doc] switching from freenode to libera.chat [deployment-charts] - 10https://gerrit.wikimedia.org/r/701591 (https://phabricator.wikimedia.org/T283273) (owner: 10Zabe) [17:41:36] urbanecm: deployment-charts is auto pulled to /srv/deployment-charts on deploy hosts [17:41:55] legoktm: so on docs only changes like this, I can safely just pull the trigger? [17:42:08] yeah [17:42:16] good to know, thanks [17:42:24] theres no "oh you left something undeployed" like MW, since helm is what controls what's deployed [17:42:46] if the change was to a service's chart or included files that requires redeploying the service, then you want to deploy it right after merging [17:42:53] (03PS1) 10CDanis: statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) [17:43:30] (03Merged) 10jenkins-bot: [doc] switching from freenode to libera.chat [deployment-charts] - 10https://gerrit.wikimedia.org/r/701591 (https://phabricator.wikimedia.org/T283273) (owner: 10Zabe) [17:43:50] i see. i deployed a service only once, under supervision [17:44:08] i wish my understanding about how it works was higher -- i guess i'll have to learn that with mw on k8s anyway [17:47:13] (03PS2) 10CDanis: statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) [17:59:29] (03CR) 10Legoktm: [C: 03+2] swift: Only run swiftrepl-mw in the active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/701052 (https://phabricator.wikimedia.org/T285373) (owner: 10Legoktm) [18:00:10] Jbond: P:logoutd: create wrapper script for calling logout.d scripts (210aa08367) [18:01:23] ok, just looks like adding a file [18:01:44] jbond: I merged your ^ change on the puppetmaster [18:02:51] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [18:06:54] Jun 25 18:02:48 ms-fe2005 systemd[1]: swiftrepl-mw.timer: Refusing to start, unit to trigger not loaded. [18:06:54] Jun 25 18:02:48 ms-fe2005 systemd[1]: Failed to start Periodic execution of swiftrepl-mw.service. [18:07:09] I've never seen this error before [18:08:12] ahhh, the unit is masked [18:08:39] !log legoktm@ms-fe2005:~$ sudo systemctl unmask swiftrepl-mw.service [18:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:46] (03CR) 10Legoktm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [18:38:02] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10Jclark-ctr) [18:47:37] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Legoktm) p:05Medium→03Low @Cyberpower678 it looks much better now, so thanks for that, however I sti... [19:01:44] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10wiki_willy) [19:08:19] 10SRE, 10Infrastructure-Foundations, 10Mail: Please create "grant@wikipedia.org" email handle to use for annual fundraising email test - https://phabricator.wikimedia.org/T285432 (10MNoorWMF) hello @faidon - permissions have been Granted by Grant :), Sam Patton is currently hunting the paper trail down and w... [19:53:29] (03CR) 10Kosta Harlan: "This change is ready for review." [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701424 (https://phabricator.wikimedia.org/T283546) (owner: 10Kosta Harlan) [19:54:21] dduvall, greg-g and SREs: Hello, Growth's newcomer homepage is currently broken, and we'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/699742 to fix it. It's a frontend change. Given it's Friday, asking for approvals per guidance at https://wikitech.wikimedia.org/wiki/Deployments/Emergencies. [19:54:24] (also see the private chan) [19:56:36] urbanecm: ok by me [19:56:50] thanks. Can I also get confirmation by a SRE please? [19:56:52] the patch itself is pretty safe -- adding an if statement around some code. [19:58:13] apergos: ^ ? [19:58:23] I'm not really an sre any more :-) [19:58:23] yeah looks fairly innocuous to me [19:58:27] oh :) [19:58:31] also I'm in a sucky tz for this [19:58:36] it's 11 pm where I am [19:58:39] k sorry! [19:58:43] you want someone in a us tz [19:58:54] looks like there's probably not much to do from an SRE standpoint but I'm happy to watch :) [19:59:04] I'm online for another 2-3 hours probably [19:59:11] rzl: may i take that as an approval to do the deployment? :) [19:59:17] urbanecm: yep, fire away [19:59:20] appreciated [19:59:26] ah thanks! I had just gone to -sre to start grovelling :-D [19:59:34] (03CR) 10Urbanecm: [C: 03+2] "emergency deployment" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701424 (https://phabricator.wikimedia.org/T283546) (owner: 10Kosta Harlan) [19:59:57] thanks everyone [20:00:02] so while I sorta kinda took sre-ish sensibilities with me and might do some of that at platform, I'm not now on the sre team, I just wanted to clarify [20:00:16] so probably not ok for me to go "yeah friday deploy cool" :-D [20:00:43] sorry to bug you apergos :) definitely noted. [20:00:55] no problem! [20:17:23] kostajh: can you help me test it? It's at mwdebug1001 now [20:17:41] urbanecm: sure, looking [20:18:34] urbanecm: yes, it's less broken now :) [20:18:48] that means we're going in the right direction! [20:18:51] thanks kostajh [20:18:55] urbanecm: lgtm, in other words [20:19:23] yup, switching from topics selected and topics unselected sound to work at ro.wikipedia [20:19:24] syncing [20:21:12] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/modules/homepage/suggestededits/ext.growthExperiments.Homepage.SuggestedEdits.js: eaec745e4504527d23ddca32eb7fcd531d5553f9: SuggestedEdits: Only log task impression for EditCardWidget (T283546; emergency deployment) (duration: 01m 00s) [20:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:18] T283546: [wmf.6-regression] SE module - "No more suggestions " card is not displayed - https://phabricator.wikimedia.org/T283546 [20:21:24] fix should be live kostajh [20:22:03] thank you urbanecm [20:22:10] any time :) [20:22:30] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [20:23:01] Re https://phabricator.wikimedia.org/T285538#7177616 given ^ (see me merging a task), are we just letting everything run at normal next run rather than running early? [20:23:48] RhinosF1: yup. If there's something more urgent, we can probably run it manually too. [20:25:48] urbanecm: doesn't look breaking [20:25:59] I doubt a few days will matter [20:26:27] Enwiki is super slow though [20:28:28] Seemed to be edit conflict handling or something because DT was fine [20:28:45] !log legoktm@mwmaint1002:~$ sudo systemctl start mediawiki_job_update_special_pages.service (T285583) [20:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:50] T285583: Special pages on WMF wikis are not updating - https://phabricator.wikimedia.org/T285583 [20:29:43] legoktm: I guess you're running it anyway [20:30:32] yeah, it's pretty straightforward and I screwed it up [20:31:05] Don't worry [20:32:17] (03PS1) 10Ssingh: admin_state: depool eqiad for datacenter switchover (June 2021) [dns] - 10https://gerrit.wikimedia.org/r/701610 (https://phabricator.wikimedia.org/T281515) [20:32:25] !log legoktm@mwmaint1002:~$ sudo systemctl reset-failed # to clear icinga alert [20:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:24] greg-g: another revert related to T285577 to be deployed, to restore task selection in the newcomers homepage, via https://gerrit.wikimedia.org/r/701609 . Ok for deployment? [20:33:25] T285577: Several wikis have 0 articles for all ORES topics - https://phabricator.wikimedia.org/T285577 [20:33:47] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:52] dduvall: rzl: ccing the kind people who approved for me 🙂 [20:34:14] 10SRE, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: All cronjobs using foreachwikiindblist broken in production: Fatal error: Uncaught Exception: MWWikiversions::readDbListFile: unable to read . in /srv/mediawiki/multiversion/M... - https://phabricator.wikimedia.org/T285538 [20:34:20] (same use-case, my fix made it less broken for users, this should fix the underlying issue) [20:35:05] a plain revert seems fine to me [20:38:30] (03PS1) 10Herron: logstash: add logstash200[123] to v7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/701611 (https://phabricator.wikimedia.org/T281266) [20:39:22] (03CR) 10Legoktm: [C: 03+1] "LGTM for Monday" [dns] - 10https://gerrit.wikimedia.org/r/701610 (https://phabricator.wikimedia.org/T281515) (owner: 10Ssingh) [20:42:04] ebernhardson: yeah, godspeed [20:42:10] greg-g: thanks! [20:45:34] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/30000/" [puppet] - 10https://gerrit.wikimedia.org/r/701611 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [20:46:00] I think you win a prize for 30000 [20:46:52] haha [20:53:05] urbanecm: sorry to miss the ping but yep sgtm [20:53:52] Thanks, not going to lead this one though :) [20:53:59] ebernhardson is [20:55:06] ah cheers [20:58:04] (03PS1) 10Ebernhardson: Revert "Add support for ores drafttopic" and "Stop querying ores_articletopics" [extensions/CirrusSearch] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701425 (https://phabricator.wikimedia.org/T285577) [20:58:50] 14 [20:59:34] (03CR) 10Ebernhardson: [C: 03+2] Revert "Add support for ores drafttopic" and "Stop querying ores_articletopics" [extensions/CirrusSearch] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701425 (https://phabricator.wikimedia.org/T285577) (owner: 10Ebernhardson) [21:10:05] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:52] (03PS1) 10Arlolra: Disable legacy media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) [21:14:58] (03CR) 10Legoktm: [C: 04-1] "The preferred way to do this is add it to InitialiseSettings.php with default => true, testwiki/test2wiki => false." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:15:06] (03CR) 10RhinosF1: "Why not just add the default in CS.php and then use IS.php like most other variables for per wiki overrides?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:15:56] (03CR) 10RhinosF1: [C: 04-1] Disable legacy media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:16:21] legoktm: clicked same time :) [21:18:26] (03CR) 10Arlolra: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:20:44] (03CR) 10RhinosF1: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:21:59] (03Merged) 10jenkins-bot: Revert "Add support for ores drafttopic" and "Stop querying ores_articletopics" [extensions/CirrusSearch] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701425 (https://phabricator.wikimedia.org/T285577) (owner: 10Ebernhardson) [21:27:08] kostajh: I have the revert on mwdebug1002 if you want to test. https://ro.wikipedia.org/wiki/Special:Pagina_acas%C4%83 seems plausible to me now [21:28:26] mewoph: could you possibly have a look? ebernhardson I’m on my phone [21:28:32] (03CR) 10Arlolra: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:28:39] sure! I forget about time zones [21:29:22] kostajh: checking rowiki now [21:30:25] ebernhardson: AFAICS, it works now [21:30:42] (03PS2) 10Arlolra: Disable legacy media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) [21:31:03] everything else seems fine to me. will ship it [21:34:23] kostajh: i can now filter by topics, No more suggestions card shows up at the end of the queue as well [21:34:35] !log ebernhardson@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/CirrusSearch/includes/Parser/FullTextKeywordRegistry.php: cirrus: Revert "Stop querying ores_articletopic" (1/3) (duration: 00m 58s) [21:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:56] mewoph: thanks! [21:35:56] !log ebernhardson@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/CirrusSearch/includes/Wikimedia/WeightedTagsHooks.php: cirrus: Revert "Stop querying ores_articletopic" (2/3) (duration: 00m 58s) [21:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:17] !log ebernhardson@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/CirrusSearch/: cirrus: Revert "Stop querying ores_articletopic" (3/3) (duration: 01m 01s) [21:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:38] (03CR) 10Legoktm: [C: 03+1] "Though maybe write the commit message as "Enable modern media structure on test wikis"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [21:54:38] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove unnecessary and broken disable of systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701593 (https://phabricator.wikimedia.org/T285524) (owner: 10Legoktm) [21:59:54] (03Merged) 10jenkins-bot: mediawiki: Remove unnecessary and broken disable of systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701593 (https://phabricator.wikimedia.org/T285524) (owner: 10Legoktm) [21:59:58] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) Ack, thanks for all the input. For next week we'll just ignore x2, it'll stay RW in both DCs throughout. @krinkle does that also work as th... [22:02:51] what are the x2 ? [22:03:10] x2 is a database cluster [22:03:10] Platonides: a DB section [22:03:33] (currently unused, afaik) [22:03:52] yes, I saw they mention it's unused [22:03:57] but what's it for? [22:04:07] https://wikitech.wikimedia.org/wiki/MariaDB#Extension_storage [22:04:16] "not really wiki" stuff [22:04:44] Strictly speaking that page only mentions X1, not x2 [22:04:51] or maybe better stated as "things that are big like pages but not pages" [22:04:58] x2 is different from the other sections in that it's read-write in both eqiad and codfw, with replication in both directions -- think of it as an early experiment in active-active architecture [22:05:19] that's why the question of how to handle it for the switchover -- it's the only section that's normally read-write even in the passive DC [22:23:36] (03CR) 10RLazarus: "Super excited for this, but I won't be able to take a look until I'm back in the office on July 6 (during what's otherwise a WMF holiday w" [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [22:29:22] *29 [22:39:17] 10SRE, 10observability, 10Datacenter-Switchover: Switchover thanos-query and thanos-swift services as part of DC switchover - https://phabricator.wikimedia.org/T285273 (10Legoktm) 05Resolved→03Open This was (likely unintentionally) reverted in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/701484 [22:52:41] 10SRE, 10Infrastructure-Foundations, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc: systemctl disable command failed, because units were already gone - https://phabricator.wikimedia.org/T285524 (10Legoktm) 05Open→03Resolved [23:02:55] (03PS3) 10Arlolra: Enable Parsoid inspired media structure on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) [23:03:21] (03CR) 10Arlolra: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [23:08:55] PROBLEM - MariaDB Replica Lag: s3 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1198.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:13:36] (03PS1) 10Cwhite: logstash: transition aqs logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/701617 (https://phabricator.wikimedia.org/T234565) [23:33:19] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10bd808) >>! In T285539#7177827, @Legoktm wrote: > Note sure if it's in sc... [23:40:27] (03PS1) 10RobH: adding sku 370-AEVP [software] - 10https://gerrit.wikimedia.org/r/701619 [23:41:57] (03Abandoned) 10RobH: adding sku 370-AEVP [software] - 10https://gerrit.wikimedia.org/r/701619 (owner: 10RobH) [23:54:25] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808)