[02:10:02] 10SRE, 10Observability-Metrics, 10observability: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10lmata) [02:18:15] (03CR) 10Samwilson: [C: 03+1] Enable DisamiguatorNotifications on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721902 (https://phabricator.wikimedia.org/T291303) (owner: 10MusikAnimal) [03:34:39] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:58:41] (03PS1) 10Urbanecm: Mentor dashboard: Mentor tools [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722129 (https://phabricator.wikimedia.org/T280307) [05:02:27] 10SRE, 10DNS, 10Traffic, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10Marostegui) p:05Triage→03Medium [05:02:51] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Confirmation message when changing email subscription is broken - https://phabricator.wikimedia.org/T291134 (10Marostegui) p:05Triage→03Medium [05:03:17] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10Marostegui) p:05Triage→03Medium [05:03:35] 10SRE, 10cloud-services-team (Kanban): Puppet on labstore1006 seems to update html files on every run - https://phabricator.wikimedia.org/T290943 (10Marostegui) p:05Triage→03Medium [05:08:28] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10Marostegui) p:05Triage→03Medium [05:08:33] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Marostegui) p:05Triage→03Medium [05:09:09] 10SRE, 10serviceops: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Marostegui) p:05Triage→03Medium [05:10:08] 10SRE, 10VPS-project-Codesearch, 10serviceops, 10HTTPS: Codesearch main page redirect uses http instead of https - https://phabricator.wikimedia.org/T290819 (10Marostegui) p:05Triage→03Medium [05:10:10] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Marostegui) p:05Triage→03Medium [05:25:55] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [06:00:12] !log Upgrade db2071, db2072, db2094 [06:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:04:32] ^ db2094 upgrade, expected [06:09:53] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:24] (03PS1) 10Muehlenhoff: Remove access for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/722251 [06:30:06] (03CR) 10jerkins-bot: [V: 04-1] Remove access for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/722251 (owner: 10Muehlenhoff) [06:31:14] 10SRE, 10cloud-services-team (Kanban): Puppet on labstore1006 seems to update html files on every run - https://phabricator.wikimedia.org/T290943 (10elukey) 05Open→03Resolved a:03elukey Seems solved, thanks Ariel! [06:33:24] (03PS2) 10Muehlenhoff: Remove access for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/722251 [06:35:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [06:35:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::packages: remove libvips [puppet] - 10https://gerrit.wikimedia.org/r/720974 (https://phabricator.wikimedia.org/T290759) (owner: 10Giuseppe Lavagetto) [06:36:03] (03PS2) 10Giuseppe Lavagetto: mediawiki::packages: remove libvips [puppet] - 10https://gerrit.wikimedia.org/r/720974 (https://phabricator.wikimedia.org/T290759) [06:36:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/722251 (owner: 10Muehlenhoff) [06:43:00] (03PS1) 10Elukey: statistics: decom old httpd Directory for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722252 (https://phabricator.wikimedia.org/T285355) [06:43:30] (03CR) 10jerkins-bot: [V: 04-1] statistics: decom old httpd Directory for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722252 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [06:45:02] (03PS2) 10Elukey: statistics: decom old httpd Directory for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722252 (https://phabricator.wikimedia.org/T285355) [06:45:44] (03PS1) 10Muehlenhoff: Remove LDAP access for ebjune [puppet] - 10https://gerrit.wikimedia.org/r/722254 [06:46:01] _joe_: er wait, we need to do https://phabricator.wikimedia.org/T291014 first [06:46:17] <_joe_> legoktm: uhhh [06:46:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31133/console" [puppet] - 10https://gerrit.wikimedia.org/r/722252 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [06:46:27] sorry, I should've -1'd your patch earlier [06:46:29] <_joe_> well, as long as we don't install new servers, we're ok [06:46:51] yeah [06:47:29] <_joe_> but I can revert though [06:47:49] it shouldn't take more than a week or two I hope [06:47:58] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki::packages: remove libvips" [puppet] - 10https://gerrit.wikimedia.org/r/721958 [06:48:14] <_joe_> yeah, as usual things are more complicated than anticipated [06:48:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for ebjune [puppet] - 10https://gerrit.wikimedia.org/r/722254 (owner: 10Muehlenhoff) [06:53:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "mediawiki::packages: remove libvips" [puppet] - 10https://gerrit.wikimedia.org/r/721958 (owner: 10Giuseppe Lavagetto) [07:00:43] (03CR) 10Muehlenhoff: [C: 03+2] Install swaks on mail servers [puppet] - 10https://gerrit.wikimedia.org/r/721809 (owner: 10Muehlenhoff) [07:02:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow injecting the wmerrors script [deployment-charts] - 10https://gerrit.wikimedia.org/r/721341 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [07:02:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::yaml_defs: inject php7-fatal-error.php in k8s [puppet] - 10https://gerrit.wikimedia.org/r/721342 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [07:05:21] PROBLEM - MariaDB Replica Lag: s7 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1199.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:05:35] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: inject statsd for php fatal errors [puppet] - 10https://gerrit.wikimedia.org/r/722255 [07:06:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::deployment_server: inject statsd for php fatal errors [puppet] - 10https://gerrit.wikimedia.org/r/722255 (owner: 10Giuseppe Lavagetto) [07:08:00] (03CR) 10ArielGlenn: [C: 03+1] "Fine by me" [puppet] - 10https://gerrit.wikimedia.org/r/722026 (https://phabricator.wikimedia.org/T290340) (owner: 10Ladsgroup) [07:11:29] (03CR) 10ArielGlenn: [C: 03+1] "Guess I'd better update all my docs too. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/721811 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [07:13:48] jouncebot: nowandnext [07:13:48] No deployments scheduled for the next 3 hour(s) and 16 minute(s) [07:13:48] In 3 hour(s) and 16 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1030) [07:13:55] (03PS3) 10Urbanecm: enwiki: Bump Growth features to 25% (mentorship limited to 20% of those users) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720825 (https://phabricator.wikimedia.org/T290927) [07:14:00] (03CR) 10Urbanecm: [C: 03+2] enwiki: Bump Growth features to 25% (mentorship limited to 20% of those users) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720825 (https://phabricator.wikimedia.org/T290927) (owner: 10Urbanecm) [07:14:43] (03Merged) 10jenkins-bot: enwiki: Bump Growth features to 25% (mentorship limited to 20% of those users) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720825 (https://phabricator.wikimedia.org/T290927) (owner: 10Urbanecm) [07:15:16] * urbanecm hates undeployed code [07:15:59] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:16:23] (03PS1) 10Urbanecm: Revert "Configure event stream for map tile state change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721959 (https://phabricator.wikimedia.org/T289771) [07:16:30] (03PS2) 10Urbanecm: Revert "Configure event stream for map tile state change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721959 (https://phabricator.wikimedia.org/T289771) [07:16:42] (03CR) 10Urbanecm: [C: 03+2] "undeployed code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721959 (https://phabricator.wikimedia.org/T289771) (owner: 10Urbanecm) [07:17:40] (03Merged) 10jenkins-bot: Revert "Configure event stream for map tile state change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721959 (https://phabricator.wikimedia.org/T289771) (owner: 10Urbanecm) [07:17:53] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:20:37] !log Revert undeployed config patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/721959); not even pulled to deployment, so assuming it never hit prod (T289771) [07:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:43] T289771: Add kafka support for tile-pregeneration events - https://phabricator.wikimedia.org/T289771 [07:27:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:28:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8c1d665b5e83f6b1dd1cc4a9c367cb6881473bba: enwiki: Bump Growth features to 25% (mentorship limited to 20% of those users) (T290927) (duration: 00m 57s) [07:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:05] T290927: Scale: increase share of users on English Wikipedia - https://phabricator.wikimedia.org/T290927 [07:29:02] and let enwiki have a nice morning :-) [07:29:35] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/722258 [07:29:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:31] !log uploaded PHP 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf2 to apt.wikimedia.org (component/php7.2 for buster-wikimedia) T291052 [07:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:36] T291052: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 [07:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 T167973', diff saved to https://phabricator.wikimedia.org/P17297 and previous config saved to /var/cache/conftool/dbconfig/20210920-073141-marostegui.json [07:31:44] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31134/console" [puppet] - 10https://gerrit.wikimedia.org/r/722258 (owner: 10Giuseppe Lavagetto) [07:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:46] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [07:31:51] (03PS2) 10Urbanecm: Revert "Add throttle rule for Czech wiki course" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721069 [07:31:55] (03CR) 10Urbanecm: [C: 03+2] Revert "Add throttle rule for Czech wiki course" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721069 (owner: 10Urbanecm) [07:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3316 T167973', diff saved to https://phabricator.wikimedia.org/P17298 and previous config saved to /var/cache/conftool/dbconfig/20210920-073206-marostegui.json [07:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:38] (03Merged) 10jenkins-bot: Revert "Add throttle rule for Czech wiki course" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721069 (owner: 10Urbanecm) [07:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 T167973', diff saved to https://phabricator.wikimedia.org/P17299 and previous config saved to /var/cache/conftool/dbconfig/20210920-073256-marostegui.json [07:33:01] <_joe_> jouncebot: next [07:33:01] In 2 hour(s) and 56 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1030) [07:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:34] _joe_: I'm finishing a deployment ftr [07:33:39] (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/722259 [07:33:57] <_joe_> urbanecm: yeah I saw, I was wondering how many patches it would be :) [07:34:02] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: af9d6e4e29e5f53ad8cf5aa2c235d54500c433bd: Revert "Add throttle rule for Czech wiki course" (duration: 00m 56s) [07:34:03] just two :) [07:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:10] _joe_: feel free to go ahead. [07:34:28] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::web::yaml_defs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/722258 (owner: 10Giuseppe Lavagetto) [07:34:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:59] !log Stop db1168 and db2129 in sync T167973 [07:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:04] (03CR) 10Marostegui: [C: 03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/722259 (owner: 10Marostegui) [07:40:53] (03PS1) 10Giuseppe Lavagetto: mwdebug: harmonize production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/722262 [07:42:38] (03CR) 10Effie Mouzeli: [C: 03+1] mwdebug: harmonize production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/722262 (owner: 10Giuseppe Lavagetto) [07:43:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:54] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:31] (03CR) 10Ayounsi: rancid: convert crons to systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721854 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [07:48:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:41] !log uploaded maps-deduped-tilelist 0.0.3~deb10u1 to buster-wikimedia/main [07:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:03] !log uploaded maps-deduped-tilelist 0.0.3~deb10u1 to buster-wikimedia/main T290982 [07:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:07] T290982: Support expired tile deduplication - https://phabricator.wikimedia.org/T290982 [07:53:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix data structure hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/722263 [07:53:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::yaml_defs: fix data structure hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/722263 (owner: 10Giuseppe Lavagetto) [07:54:16] (03PS1) 10Muehlenhoff: Install python3-maps-deduped-tilelist on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/722264 (https://phabricator.wikimedia.org/T290982) [07:58:00] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:19] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/721857 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [08:02:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:25] (03CR) 10Elukey: [V: 03+1 C: 03+2] statistics: decom old httpd Directory for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722252 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [08:16:14] (03CR) 10Jgiannelos: [C: 03+1] Install python3-maps-deduped-tilelist on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/722264 (https://phabricator.wikimedia.org/T290982) (owner: 10Muehlenhoff) [08:20:27] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:25:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: harmonize production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/722262 (owner: 10Giuseppe Lavagetto) [08:26:04] (03PS1) 10KartikMistry: Update cxserver to 2021-09-16-130208-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/722268 [08:26:35] (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/721960 [08:27:41] (03CR) 10Marostegui: [C: 03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/721960 (owner: 10Marostegui) [08:30:22] (03Merged) 10jenkins-bot: mwdebug: harmonize production values [deployment-charts] - 10https://gerrit.wikimedia.org/r/722262 (owner: 10Giuseppe Lavagetto) [08:32:07] 10SRE, 10observability: Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10Marostegui) p:05Triage→03Medium [08:35:01] !log updating clamav on ticket.wikimedia.org/otrs1001 to 0.103.3 [08:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:47] (03PS1) 10Elukey: statistics: remove leftovers for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) [08:38:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31135/console" [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [08:40:30] (03PS2) 10Elukey: statistics: remove leftovers for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) [08:41:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31136/console" [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [08:42:11] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:42:59] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10Volans) [08:43:03] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10Volans) 05Resolved→03Open @ssingh I understand there was some issue with the DNS setup between Netbox automation and manual records. I'll try to shade some light here: - When an... [08:47:49] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) I have some alternative ideas. Specifically, right now we have a limited number of different clusters, due to the complexity of corre... [08:49:10] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split of cac... [08:50:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850 (owner: 10Elukey) [08:52:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli) [08:52:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 (owner: 10Effie Mouzeli) [08:52:55] (03CR) 10Giuseppe Lavagetto: fixtures: add fixtures for tcp_services_proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli) [08:58:01] (03PS3) 10Effie Mouzeli: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 [09:08:56] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2021-12-12 08:02:36 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:10:15] !log installing openssl1.0 updates for stretch with backport for forthcoming Let's encrypt issuance chain update (T283165) [09:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:20] T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 [09:14:32] RECOVERY - MariaDB Replica Lag: s7 on db2098 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:18:44] 10Puppet, 10Infrastructure-Foundations: Temporary failures for prometheus_puppet_agent_stats - https://phabricator.wikimedia.org/T290726 (10jbond) thanks for the fix @fgiunchedi can this be resolved now? [09:21:09] (03Abandoned) 10Giuseppe Lavagetto: trafficserver::text: also allow www.mediawiki.org for XWD [puppet] - 10https://gerrit.wikimedia.org/r/708772 (owner: 10Giuseppe Lavagetto) [09:23:04] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:33] (03CR) 10Jbond: "nice catch" [puppet] - 10https://gerrit.wikimedia.org/r/720928 (owner: 10Vgutierrez) [09:33:42] (03CR) 10Marostegui: [C: 03+2] Conftool-sections: farewell s10 [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [09:35:54] (03PS1) 10Marostegui: sections.yaml: Remove s10 from codfw [puppet] - 10https://gerrit.wikimedia.org/r/722273 (https://phabricator.wikimedia.org/T167973) [09:42:21] (03PS1) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 [09:42:37] (03CR) 10Jgiannelos: [C: 04-1] "Blocking until deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [09:42:55] (03CR) 10Marostegui: "eqiad was already done: https://gerrit.wikimedia.org/r/722273" [puppet] - 10https://gerrit.wikimedia.org/r/722273 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [09:47:10] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:47:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove s10 from eqiad T167973', diff saved to https://phabricator.wikimedia.org/P17300 and previous config saved to /var/cache/conftool/dbconfig/20210920-094739-marostegui.json [09:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:45] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [09:48:07] marostegui: should we raise the time for the alert? ^^^ [09:48:45] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [09:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:14] I don't have any perception of how many times it fires as a false positive, but happy to bump it a bit if fires in normal usage [09:52:08] volans: No, in this case it was good as I forgot the commit :) [09:53:00] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:53:47] ok :) [09:58:43] (03CR) 10Jbond: "See comments inline and as always feel free to ping online if they don't make sense" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [09:59:08] !log restarting apache2 on thorium [09:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:35] (03PS16) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [10:00:37] (03PS14) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [10:00:39] (03PS8) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [10:00:41] (03PS4) 10Elukey: kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) [10:00:43] (03PS1) 10Elukey: helmfile.d: move private dirs to the new format [deployment-charts] - 10https://gerrit.wikimedia.org/r/722276 (https://phabricator.wikimedia.org/T286791) [10:01:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:02:49] (03PS1) 10Giuseppe Lavagetto: service::catalog: remove ProxyFetch checks from services on k8s [puppet] - 10https://gerrit.wikimedia.org/r/722278 [10:03:19] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722279 (https://phabricator.wikimedia.org/T280027) (owner: 10Awight) [10:03:26] (03CR) 10jerkins-bot: [V: 04-1] service::catalog: remove ProxyFetch checks from services on k8s [puppet] - 10https://gerrit.wikimedia.org/r/722278 (owner: 10Giuseppe Lavagetto) [10:03:45] (03PS2) 10Elukey: helmfile.d: move private dirs to the new format [deployment-charts] - 10https://gerrit.wikimedia.org/r/722276 (https://phabricator.wikimedia.org/T286791) [10:03:47] (03PS17) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [10:03:49] (03PS15) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [10:03:51] (03PS9) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [10:03:53] (03PS5) 10Elukey: kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) [10:04:24] (03PS2) 10Giuseppe Lavagetto: service::catalog: remove ProxyFetch checks from services on k8s [puppet] - 10https://gerrit.wikimedia.org/r/722278 [10:18:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We're on the right path, but see my comments to make the manifests properly general." [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:26:14] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:28:22] (03PS1) 10David Caro: wmcs: fix lints [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/722282 [10:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1030). [10:31:44] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Marostegui) p:05High→03Medium Is there anything pending here or can it be closed? (It is not High anymore I reckon) [10:32:02] (03CR) 10jerkins-bot: [V: 04-1] wmcs: fix lints [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/722282 (owner: 10David Caro) [10:36:39] 10Puppet, 10Infrastructure-Foundations: apt::package_from component dosn't corretlly support passing packages via a hash - https://phabricator.wikimedia.org/T291370 (10jbond) 05Open→03In progress p:05Triage→03Medium [10:36:50] !log rolling restart bacula & minio daemons on backup hosts [10:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:59] ^ moritzm [10:40:39] ack, thx [10:41:20] !log roll restarting kartotherian and tilerator on maps1* [10:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:24] (03CR) 10Urbanecm: [C: 03+2] "will deploy this during B&C, needed to facilitate usability testing" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722129 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [10:45:09] !log roll restarting kartotherian and tilerator on maps2* [10:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:09] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 6 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) 05Open→03In progress [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1100). [11:00:05] musikanimal and Urbanecm: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:11] * urbanecm waves [11:00:18] I'll deploy, as I have my own (complex) patch [11:00:34] musikanimal: hi, around? [11:00:59] hm, puppet’s data.yaml is confusing me [11:01:08] the restricted group is commented as “a subset of the deployment group” [11:01:14] but it looks like musikanimal is in restricted but not deployment? [11:01:17] yeah, that should likely be changed [11:01:20] so I’m not sure if they’re able to self-serve ^^ [11:01:23] they're not [11:01:27] ok [11:01:38] (they used to be a deployer, but were removed IIRC) [11:01:57] in fact, from a cursory check, it looks like most of the restricted people aren’t in deployers [11:02:01] yeah [11:02:09] so I guess just the comment is wrong [11:02:12] deployment can do everything restricted can and a bit more [11:02:15] yeah [11:02:18] it's a subset of _privileges_ [11:02:20] not people [11:02:23] ah, I see [11:02:28] that makes some sense [11:02:37] (restricted = mwmaint, mwlog; deployment = mwmaint, mwlog, deployment host, app servers, ...) [11:02:44] (in terms of host access) [11:03:43] (03Merged) 10jenkins-bot: Mentor dashboard: Mentor tools [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722129 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [11:03:54] okay, let's do this [11:04:10] (this will take quite some time, but the only other patch is beta, so it doesn't require syncing) [11:04:50] I'm going to first sync the files in a correct order, and then do a full scap on top of it to take care of the i18n things [11:04:54] (03PS1) 10Lucas Werkmeister (WMDE): Clarify comment of restricted group [puppet] - 10https://gerrit.wikimedia.org/r/722331 [11:04:55] ok [11:05:55] !log roll restarting restbase service in eqiad for openssl updates [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:13] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/includes/MentorDashboard/MentorTools/MentorStatusManager.php: b9031bc572f6e3f4e12e6102c2816467af3580f4: Mentor dashboard: Mentor tools (T280307; 1) (duration: 00m 57s) [11:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [11:07:32] !log urbanecm@deploy1002 sync-file aborted: b9031bc572f6e3f4e12e6102c2816467af3580f4: Mentor dashboard: Mentor tools (T280307; 1) (duration: 00m 00s) [11:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:30] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/includes/MentorDashboard/Modules/MentorTools.php: b9031bc572f6e3f4e12e6102c2816467af3580f4: Mentor dashboard: Mentor tools (T280307; 2) (duration: 00m 55s) [11:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:17] !log roll restarting restbase service in codfw [11:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:13] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/ServiceWiring.php: b9031bc572f6e3f4e12e6102c2816467af3580f4: Mentor dashboard: Mentor tools (T280307; 4) (duration: 00m 56s) [11:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:53] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/includes/: b9031bc572f6e3f4e12e6102c2816467af3580f4: Mentor dashboard: Mentor tools (T280307; 5) (duration: 00m 56s) [11:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:12:36] if all done, I can deploy something of mine [11:13:05] Amir1: I'll need a full scap :/ [11:13:22] oh noooo [11:13:31] Do you have ETA? [11:13:35] (03PS1) 10Urbanecm: Mentor dashboard: Enable beta mode at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722338 (https://phabricator.wikimedia.org/T281534) [11:13:41] Amir1: i didn't start it yet :D [11:13:47] I can quickly squeeze mine [11:13:49] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Enable beta mode at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722338 (https://phabricator.wikimedia.org/T281534) (owner: 10Urbanecm) [11:13:54] it'll be blazing fast [11:13:59] Amir1: if it's something quick, feel free to [11:14:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:14:52] (03PS2) 10Btullis: Add temporary rsync modules to two Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) [11:14:56] (03Merged) 10jenkins-bot: Mentor dashboard: Enable beta mode at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722338 (https://phabricator.wikimedia.org/T281534) (owner: 10Urbanecm) [11:14:58] (03PS1) 10Jbond: apt::package_from_component: use apt-get update exec from init class [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) [11:15:02] (03PS1) 10Ladsgroup: Disable jQuery Migrate on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722348 (https://phabricator.wikimedia.org/T280944) [11:15:18] (03PS2) 10Ladsgroup: Disable jQuery Migrate on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722348 (https://phabricator.wikimedia.org/T280944) [11:15:29] Amir1: once logmsgbot !log's, feel free to go ahead [11:15:29] (03CR) 10Ladsgroup: [C: 03+2] Disable jQuery Migrate on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722348 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [11:15:42] sure [11:15:44] I'll start the full scap once you're done [11:15:44] Thanks [11:16:12] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b518d8ba03e85afdf98f2e06bf569b4f2b551b1b: Mentor dashboard: Enable beta mode at testwiki (T281534) (duration: 00m 55s) [11:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:16] T281534: Conduct user testing for mentor dashboard V1 - https://phabricator.wikimedia.org/T281534 [11:16:26] (03CR) 10jerkins-bot: [V: 04-1] apt::package_from_component: use apt-get update exec from init class [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) (owner: 10Jbond) [11:16:54] (03Merged) 10jenkins-bot: Disable jQuery Migrate on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722348 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [11:16:59] (03CR) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [11:17:05] (03PS2) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 [11:18:08] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:722348|Disable jQuery Migrate on group1 (T280944)]] (duration: 00m 56s) [11:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:13] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [11:18:31] (03PS3) 10Btullis: Add temporary rsync modules to two Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) [11:18:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:05] Amir1: was that all? :) [11:19:17] yup [11:19:23] okay, starting the scap [11:19:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31140/console" [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [11:20:01] !log urbanecm@deploy1002 Started scap: b9031bc: Mentor dashboard: Mentor tools (T280307) [11:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:06] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [11:20:37] I would also like to deploy this during the backport window: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/722289 [11:20:41] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: add password placeholder for manila db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/722351 [11:21:40] nemo-yiannis: unfortunately, that has to wait -- i just started deployment for my own patch, which is time consuming, and might well take the rest of the window. As your patch wasn't scheduled at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1100, I didn't know about it. [11:22:10] no worries, i can send it in the next window [11:22:20] sounds great! [11:22:44] (03CR) 10Jgiannelos: [C: 04-1] "Block until next deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [11:23:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:55] hmm, already at sync-apaches? full scaps are quite hard to predict it seems [11:30:57] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: openstack: add password placeholder for manila db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/722351 (owner: 10Arturo Borrero Gonzalez) [11:31:38] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:31:46] !log urbanecm@deploy1002 Finished scap: b9031bc: Mentor dashboard: Mentor tools (T280307) (duration: 11m 44s) [11:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:54] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [11:31:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:46] well, quicker than I'd expect it to [11:32:52] nemo-yiannis: feel free to do your patch then :-) [11:33:10] (instructions for deployers are at wikitech.wikimedia.org/wiki/Backport_windows/Deployers -- if anything's unclear, do not hesitate to ask) [11:33:25] (03PS2) 10Jbond: apt::package_from_component: use apt-get update exec from init class [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) [11:33:34] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:34:08] (03CR) 10jerkins-bot: [V: 04-1] apt::package_from_component: use apt-get update exec from init class [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) (owner: 10Jbond) [11:36:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:18] (03CR) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [11:40:16] (03PS1) 10Btullis: Enable the kerberos auto-renew service for stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/722352 (https://phabricator.wikimedia.org/T268985) [11:41:41] (03CR) 10Arturo Borrero Gonzalez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:42:00] (03PS49) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [11:42:13] (03CR) 10Btullis: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:42:44] (03PS9) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) [11:46:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:26] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:47:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [11:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:22] (03PS2) 10KartikMistry: WIP: Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) [11:48:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:48:42] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [11:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:53] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10jijiki) We'll first roll out on our canaries and 5 parsoid servers, and continue with full roll out tomorrow. [11:50:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:16] (03CR) 10Jgiannelos: [C: 04-1] Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [11:52:06] (03CR) 10Ladsgroup: [C: 03+1] "Hi Daniel, do you have time to take a look? We checked it separately and made sure it works." [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [11:56:50] (03PS1) 10Arturo Borrero Gonzalez: openstack: manila: don't show config file diff [puppet] - 10https://gerrit.wikimedia.org/r/722355 (https://phabricator.wikimedia.org/T291257) [11:57:18] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation={list,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:59:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: manila: don't show config file diff [puppet] - 10https://gerrit.wikimedia.org/r/722355 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:59:28] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:59:55] (03PS3) 10Jbond: apt::package_from_component: use apt-get update exec from init class [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) [12:01:06] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:01:24] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:03:39] (03PS2) 10Cathal Mooney: Replacing SSH pub key for mbinder as he rebuilt his laptop. [puppet] - 10https://gerrit.wikimedia.org/r/721853 (https://phabricator.wikimedia.org/T291141) [12:08:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [12:12:27] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) First pass of Commons full originals completed after 19 days (eqiad), with 99.94% success. Most misses expected, due... [12:12:35] 10Puppet, 10Infrastructure-Foundations: investigate how rspec parses define paramters - https://phabricator.wikimedia.org/T291374 (10jbond) p:05Triage→03Low a:03jbond [12:16:53] (03CR) 10Effie Mouzeli: [C: 03+2] common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli) [12:19:05] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @ssastry we have done some benchmarks, but non of those were parsoid urls, it would great if you would provide a couple of par... [12:22:07] (03Merged) 10jenkins-bot: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli) [12:24:10] (03CR) 10Jbond: "PCC https://puppet-compiler.wmflabs.org/compiler1003/31142/" [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) (owner: 10Jbond) [12:24:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: manila: correct variable expansion in template [puppet] - 10https://gerrit.wikimedia.org/r/722357 (https://phabricator.wikimedia.org/T291257) [12:24:51] (03CR) 10Btullis: Install Alluxio to the test cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:25:12] (03PS4) 10Effie Mouzeli: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 [12:29:18] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [12:30:36] (03CR) 10Effie Mouzeli: [C: 03+2] fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli) [12:34:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: manila: correct variable expansion in template [puppet] - 10https://gerrit.wikimedia.org/r/722357 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [12:35:35] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 (owner: 10Effie Mouzeli) [12:39:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 (owner: 10Volans) [12:39:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720994 (owner: 10Volans) [12:40:04] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10BBlack) Thanks for the clarity, makes a lot of sense! We **can** make this work in either direction, I think (manual or automatic for this handful of IPs/hostnames which occupy thes... [12:41:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720995 (owner: 10Volans) [12:41:26] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:41:59] (03CR) 10Jbond: [C: 03+1] puppet: reduce verbosity of Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/720996 (owner: 10Volans) [12:42:23] !log Add ct_tag_id_log key to db1144:3314 T277416 [12:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:52] (03PS5) 10Jgiannelos: tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 (owner: 10Effie Mouzeli) [12:43:54] (03PS5) 10Jgiannelos: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli) [12:43:56] (03PS5) 10Jgiannelos: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [12:44:22] (03CR) 10Jbond: apt::package_from_component: add update condition for multiple packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721275 (owner: 10Hnowlan) [12:45:14] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 (owner: 10Effie Mouzeli) [12:45:14] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:53:35] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 (owner: 10Effie Mouzeli) [12:54:10] !log installing gnutls28 updates for stretch with backport for forthcoming Let's encrypt issuance chain update (T283165) [12:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:18] T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 [12:57:38] (03Merged) 10jenkins-bot: tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 (owner: 10Effie Mouzeli) [12:58:51] !log Drop ct_tag_id_log key from db1144:3314 T277416 [12:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:40] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:00] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:09:50] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:10:01] (03CR) 10Effie Mouzeli: [C: 03+2] fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli) [13:10:16] (03CR) 10Ottomata: [C: 03+1] Add temporary rsync modules to two Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [13:10:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:10:48] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add temporary rsync modules to two Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [13:11:03] (03CR) 10Ottomata: [C: 03+1] "TY!" [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [13:11:58] (03PS6) 10Jgiannelos: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [13:13:59] (03Merged) 10jenkins-bot: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli) [13:15:14] (03PS7) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) [13:15:16] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10nskaggs) 05Stalled→03Resolved This also worked without issue over the weekend. Closing for now as resolved. Thanks! [13:17:56] (03PS1) 10MSantos: maps: re-enable OSM sync maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/722364 [13:18:19] (03PS27) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [13:19:23] (03CR) 10David Caro: "The failures are all under the sre tree, not to be changed in this branch, so ignore please." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/722282 (owner: 10David Caro) [13:19:38] (03PS8) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) [13:20:09] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. - elukey@cumin1001 [13:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:39:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. - elukey@cumin1001 [13:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:04] !log restarting apache on Logstash ELK5 cluster to pick up GNUTLS update T283165 [13:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:09] T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 [13:45:47] 10SRE, 10serviceops: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Reedy) 05Open→03Stalled [13:46:47] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [13:47:03] moritzm: do we need to do anything on the cloud side for the le chain updates? [13:47:06] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:47:32] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:51] majavah: for the handful of remaining WMCS servers in production, there wasn't anything relevant (except some services like nslcd, but which doesn't really interface with LE) [13:49:00] RECOVERY - Long running screen/tmux on gitlab2001 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [13:49:02] not sure about Toolforge (if there are Stretch instances left) [13:49:19] the grid is unfortunately still stretch :/ [13:49:34] and we have some stretch containers [13:49:49] it's fixed in the openssl1.0 and gnutls28 updates# [13:50:05] unattended-upgrades should take care of them, but you'd might need to restart a few services? [13:50:16] best to create a separate task I suppose [13:50:57] (03Merged) 10jenkins-bot: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [13:51:08] yeah [13:51:12] are you creating or should I? [13:51:27] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:12] majavah: please go ahead :-) [13:53:45] * majavah does [13:54:37] (03PS1) 10Ottomata: Add comment about 'statistics' packages [puppet] - 10https://gerrit.wikimedia.org/r/722368 (https://phabricator.wikimedia.org/T275786) [13:57:18] moritzm: T291387 [13:57:18] T291387: Ensure Cloud Services platforms will accept new LE issuance chain - https://phabricator.wikimedia.org/T291387 [13:57:50] ack, thx [14:03:33] (03CR) 10Ottomata: [C: 03+2] Add comment about 'statistics' packages [puppet] - 10https://gerrit.wikimedia.org/r/722368 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [14:03:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: After migrating wikitech to codfw', diff saved to https://phabricator.wikimedia.org/P17302 and previous config saved to /var/cache/conftool/dbconfig/20210920-140333-root.json [14:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] (03CR) 10Jgiannelos: [C: 03+1] maps: re-enable OSM sync maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/722364 (owner: 10MSantos) [14:09:08] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10Volans) >>! In T289536#7365249, @BBlack wrote: > Thanks for the clarity, makes a lot of sense! > > We **can** make this work in either direction, I think (manual or automatic for th... [14:09:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10jijiki) [14:11:30] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) Thanks Effie. I think as well as the microbursts / drops you observed at the server-side, on the 1G interfaces, performance is probably impacted by on... [14:14:03] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10CAS-SSO, and 3 others: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10jbond) >>! In T286905#7342139, @MoritzMuehlenhoff wrote: > Adding this functionality goes a little beyond the scope of the logout.d scripts I think. Right no... [14:18:34] (03PS1) 10Jelto: modules::gitlab add missing fields from ansible gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) [14:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: After migrating wikitech to codfw', diff saved to https://phabricator.wikimedia.org/P17303 and previous config saved to /var/cache/conftool/dbconfig/20210920-141836-root.json [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] (03PS4) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [14:21:06] (03PS5) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [14:21:35] (03CR) 10jerkins-bot: [V: 04-1] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [14:21:40] (03PS6) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [14:22:10] (03CR) 10jerkins-bot: [V: 04-1] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [14:24:37] (03PS7) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [14:28:41] 10SRE, 10Infrastructure-Foundations, 10Traffic: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10MoritzMuehlenhoff) For production: * OpenSSL in Buster and Bullseye is not affected (only ship OpenSSL 1.1) * OpenSSL updates for openssl 1.0.2 in St... [14:29:27] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: fix routed_via [deployment-charts] - 10https://gerrit.wikimedia.org/r/722372 [14:30:43] (03PS1) 10Volans: prospector: disable pylint consider-using-f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/722373 [14:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: After migrating wikitech to codfw', diff saved to https://phabricator.wikimedia.org/P17304 and previous config saved to /var/cache/conftool/dbconfig/20210920-143340-root.json [14:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:03] (03CR) 10Jbond: [C: 03+2] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [14:36:04] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: fix routed_via [deployment-charts] - 10https://gerrit.wikimedia.org/r/722372 (owner: 10Effie Mouzeli) [14:36:25] (03CR) 10Jbond: [C: 03+1] prospector: disable pylint consider-using-f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/722373 (owner: 10Volans) [14:36:48] (03CR) 10Volans: [C: 03+2] prospector: disable pylint consider-using-f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/722373 (owner: 10Volans) [14:37:58] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] create role to deploy staging instance for quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [14:38:23] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] create role to deploy staging instance for quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [14:38:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) Ok so looking at the results from the two hosts in question I'm not sure we can make any definitive conclusions. Following the switchover back to eqia... [14:39:14] (03Merged) 10jenkins-bot: prospector: disable pylint consider-using-f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/722373 (owner: 10Volans) [14:40:14] (03Merged) 10jenkins-bot: tegola-vector-tiles: fix routed_via [deployment-charts] - 10https://gerrit.wikimedia.org/r/722372 (owner: 10Effie Mouzeli) [14:41:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: fix lints [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/722282 (owner: 10David Caro) [14:42:11] (03PS1) 10Marostegui: admin: Acces resquest for Mew Ophaswongse [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) [14:42:34] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:44] (03CR) 10Marostegui: [C: 04-2] "Waiting for approvals" [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) (owner: 10Marostegui) [14:42:46] (03CR) 10jerkins-bot: [V: 04-1] admin: Acces resquest for Mew Ophaswongse [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) (owner: 10Marostegui) [14:43:19] (03CR) 10Cathal Mooney: [C: 03+2] Replacing SSH pub key for mbinder as he rebuilt his laptop. [puppet] - 10https://gerrit.wikimedia.org/r/721853 (https://phabricator.wikimedia.org/T291141) (owner: 10Cathal Mooney) [14:45:31] (03CR) 10Herron: "A few high level questions:" [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [14:46:28] (03PS1) 10Volans: pylint: fix newly reported issue [software/spicerack] - 10https://gerrit.wikimedia.org/r/722376 [14:47:48] (03PS2) 10Marostegui: admin: Access resquest for Mew Ophaswongse [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) [14:48:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10Marostegui) [14:48:15] (03PS3) 10Marostegui: admin: Access resquest for Mew Ophaswongse [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) [14:48:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: After migrating wikitech to codfw', diff saved to https://phabricator.wikimedia.org/P17305 and previous config saved to /var/cache/conftool/dbconfig/20210920-144844-root.json [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] (03PS1) 10Volans: prospector: disable pylint consider-using-f-string [software/cumin] - 10https://gerrit.wikimedia.org/r/722378 [14:55:23] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb={PATCH,POST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:55:49] (03PS4) 10Marostegui: admin: Access request for Mew Ophaswongse [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) [14:56:03] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:57:27] (03PS4) 10Jbond: apt::package_from_component: use apt-get update exec from init class [puppet] - 10https://gerrit.wikimedia.org/r/722345 (https://phabricator.wikimedia.org/T291370) [15:00:25] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:00:50] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:01:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [15:04:37] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:08:13] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:08:37] (03PS14) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [15:09:04] urbanecm: sorry, I scheduled for the wrong time slot! rescheduled for 18:00 UTC [15:09:10] (03CR) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [15:09:37] and no, I do not have deployer rights [15:14:30] (03PS1) 10Majavah: hieradata: fix profile::puppet::client_bucket::file_age on cloud [puppet] - 10https://gerrit.wikimedia.org/r/722380 [15:16:05] (03CR) 10Razzi: [C: 03+1] statistics: remove leftovers for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [15:18:40] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Elitre) Can this be closed then? [15:18:50] (03CR) 10Jbond: [C: 03+2] "thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/722380 (owner: 10Majavah) [15:21:01] (03PS1) 10Jforrester: Disable logging [extensions/GuidedTour] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722395 (https://phabricator.wikimedia.org/T288416) [15:22:21] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:22:49] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:25:58] (03PS1) 10Jbond: P:puppet::client_bucket: temporarily absent the nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/722382 [15:26:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppet::client_bucket: temporarily absent the nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/722382 (owner: 10Jbond) [15:27:09] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:28:19] PROBLEM - Check for large files in client bucket on ms-be1061 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:29:14] PROBLEM - Check for large files in client bucket on cloudelastic1001 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:29:19] PROBLEM - Check for large files in client bucket on elastic2053 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:29:24] PROBLEM - Check for large files in client bucket on elastic1043 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:29:25] PROBLEM - Check for large files in client bucket on elastic2035 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:29:35] PROBLEM - Check for large files in client bucket on ms-be2065 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:29:43] PROBLEM - Check for large files in client bucket on rdb1006 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:30:09] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:30:43] PROBLEM - Check for large files in client bucket on ms-be1043 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:30:49] PROBLEM - Check for large files in client bucket on restbase1017 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:31:34] PROBLEM - Check for large files in client bucket on ms-be1057 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:31:47] PROBLEM - Check for large files in client bucket on restbase2016 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:31:53] PROBLEM - Check for large files in client bucket on ms-be1032 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:31:54] PROBLEM - Check for large files in client bucket on ganeti1022 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:32:04] PROBLEM - Check for large files in client bucket on kubernetes2005 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:32:05] PROBLEM - Check for large files in client bucket on sessionstore1001 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:32:11] PROBLEM - Check for large files in client bucket on ores2002 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:32:31] jbond: related? [15:32:38] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Legoktm) 05Open→03Resolved I'm not aware of anything else, and probably if there is it's worth tracking it individually. [15:32:59] PROBLEM - Check for large files in client bucket on cloudelastic1004 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:07] PROBLEM - Check for large files in client bucket on elastic1040 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:15] PROBLEM - Check for large files in client bucket on ms-be2062 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:20] (03CR) 10Muehlenhoff: [C: 03+2] Install python3-maps-deduped-tilelist on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/722264 (https://phabricator.wikimedia.org/T290982) (owner: 10Muehlenhoff) [15:33:25] PROBLEM - Check for large files in client bucket on kubernetes1017 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:25] PROBLEM - Check for large files in client bucket on aqs1006 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:27] PROBLEM - Check for large files in client bucket on ms-be1047 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:27] PROBLEM - Check for large files in client bucket on restbase1020 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:28] PROBLEM - Check for large files in client bucket on ms-be2047 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:34] PROBLEM - Check for large files in client bucket on db1118 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:41] PROBLEM - Check for large files in client bucket on elastic2044 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:43] PROBLEM - Check for large files in client bucket on kubernetes1009 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:33:54] PROBLEM - Check for large files in client bucket on ores2008 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:34:23] PROBLEM - Check for large files in client bucket on ores1004 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:34:29] jbond: --^ [15:34:41] PROBLEM - Check for large files in client bucket on kafka-main1002 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:34:49] PROBLEM - Check for large files in client bucket on ms-be2061 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:35:04] PROBLEM - Check for large files in client bucket on sessionstore2003 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:35:27] PROBLEM - Check for large files in client bucket on pybal-test2002 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:35:46] * volans running puppet on alert1001 to see if it removed a bunch of them [15:35:49] PROBLEM - Check for large files in client bucket on cloudelastic1003 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:01] PROBLEM - Check for large files in client bucket on elastic2039 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:01] PROBLEM - Check for large files in client bucket on elastic2030 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:04] PROBLEM - Check for large files in client bucket on ganeti2009 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:04] PROBLEM - Check for large files in client bucket on kafka-main1003 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:04] PROBLEM - Check for large files in client bucket on ganeti1013 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:04] PROBLEM - Check for large files in client bucket on elastic2050 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:05] PROBLEM - Check for large files in client bucket on ms-be1028 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:05] PROBLEM - Check for large files in client bucket on ms-be1058 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:07] I think it gets removed by puppet on the target host and then icinga still try to run the check via NRPE [15:36:07] PROBLEM - Check for large files in client bucket on ms-fe1008 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:10] and hence the failure [15:36:15] moving to -sre [15:36:19] PROBLEM - Check for large files in client bucket on sessionstore2001 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:19] PROBLEM - Check for large files in client bucket on restbase-dev1004 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:39] PROBLEM - Check for large files in client bucket on ganeti2011 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:36:49] PROBLEM - Check for large files in client bucket on ms-be2052 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:37:01] PROBLEM - Check for large files in client bucket on elastic1056 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:03] PROBLEM - Check for large files in client bucket on ms-be1045 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:05] PROBLEM - Check for large files in client bucket on db1139 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:05] PROBLEM - Check for large files in client bucket on conf1005 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:11] PROBLEM - Check for large files in client bucket on ms-be2060 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:14] PROBLEM - Check for large files in client bucket on elastic1052 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:15] PROBLEM - Check for large files in client bucket on elastic2043 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:29] PROBLEM - Check for large files in client bucket on ms-be2049 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:38:31] PROBLEM - Check for large files in client bucket on ms-be2054 is CRITICAL: NRPE: Command check_check_client_bucket_large_file not defined https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [15:50:05] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) >>! In T289536#7365588, @Volans wrote: > That said Netbox is not and will probably never be (from upstream comments) a DNS source of truth. We already have cases not well co... [15:54:37] (03CR) 10Hnowlan: [C: 03+2] maps: re-enable OSM sync maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/722364 (owner: 10MSantos) [15:56:20] (03PS9) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [16:02:58] (03PS1) 10Jbond: P:puppet::client_bucket: update to use a proper nrpe check script [puppet] - 10https://gerrit.wikimedia.org/r/722406 [16:05:20] (03CR) 10Jbond: [C: 03+2] P:puppet::client_bucket: update to use a proper nrpe check script [puppet] - 10https://gerrit.wikimedia.org/r/722406 (owner: 10Jbond) [16:15:54] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:32] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/nda group - https://phabricator.wikimedia.org/T291391 (10Marostegui) p:05Triage→03Medium [16:16:38] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/nda group - https://phabricator.wikimedia.org/T291391 (10Marostegui) a:03Marostegui [16:23:15] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:52] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10cmooney) @MBinder_WMF I've pushed the change now, please test and see if your access is now working and let me know. thanks! [16:36:41] 10Puppet, 10Infrastructure-Foundations: investigate how rspec parses define paramters - https://phabricator.wikimedia.org/T291374 (10jbond) I have created a demo project and upstream issue to dig into this https://github.com/puppetlabs/rspec-puppet/issues/13 [16:41:06] !log upgrading php on wtp[1025-1029] to 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf2 - T291052 [16:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:11] T291052: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 [16:42:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10procurement: Move AMS-IX port to 802.1q tagged and get "private vlan" added - https://phabricator.wikimedia.org/T291407 (10cmooney) [16:45:55] (03PS1) 10Jbond: P:puppet::client_bucket: exit with 0 when ok not 1 [puppet] - 10https://gerrit.wikimedia.org/r/722409 [16:46:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppet::client_bucket: exit with 0 when ok not 1 [puppet] - 10https://gerrit.wikimedia.org/r/722409 (owner: 10Jbond) [16:57:14] (03CR) 10Dzahn: rancid: convert crons to systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721854 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:57:49] (03PS2) 10Dzahn: rancid: convert crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/721854 (https://phabricator.wikimedia.org/T273673) [16:58:57] (03PS1) 10Legoktm: Revert "traffic: Depool codfw from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/722397 (https://phabricator.wikimedia.org/T287539) [16:59:12] (03CR) 10RLazarus: [C: 03+1] Revert "traffic: Depool codfw from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/722397 (https://phabricator.wikimedia.org/T287539) (owner: 10Legoktm) [16:59:46] (03PS2) 10Legoktm: Revert "traffic: Depool codfw from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/722397 (https://phabricator.wikimedia.org/T287539) [17:00:05] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1700). [17:00:18] thanks rzl :) [17:00:24] thank you! [17:00:49] (03CR) 10Legoktm: [C: 03+2] Revert "traffic: Depool codfw from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/722397 (https://phabricator.wikimedia.org/T287539) (owner: 10Legoktm) [17:02:18] > OK - authdns-update successful on all nodes! [17:02:43] !log repooled codfw (traffic/caches) 1 week after DC switchover [17:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:22] (03CR) 10Dzahn: "normally I would have reviewed these by checking if it matches LDAP but when I try ldapsearch on mwmaint I am being asked for a password n" [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) (owner: 10Marostegui) [17:15:49] (03CR) 10Urbanecm: admin: Access request for Mew Ophaswongse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) (owner: 10Marostegui) [17:25:49] (03CR) 10Dzahn: [C: 03+1] "The request stated on the ticket was "run maintenance scripts in mwmaint servers and to train link recommendation model on stats machines"" [puppet] - 10https://gerrit.wikimedia.org/r/722375 (https://phabricator.wikimedia.org/T290200) (owner: 10Marostegui) [17:35:10] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/31145/ms-fe2005.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:42:30] codfw looks mostly back now [17:42:57] I'm a bit surprised none of the traffic increase/decrease alerts fired [17:43:59] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) 05Open→03Resolved a:03Legoktm All done! [17:54:25] 10SRE, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Legoktm) [17:54:37] 10SRE, 10Traffic, 10MW-1.35-notes (1.35.0-wmf.40; 2020-07-07), 10Patch-For-Review, and 2 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Legoktm) [17:59:24] (03CR) 10Bstorm: "Arguably, since we made scaling out fairly simple, you should just scale up to 2 replicas if you want to avoid brief downtimes." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T1800) [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:06:44] (03CR) 10MSantos: [C: 03+1] Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [18:11:57] PROBLEM - Long running screen/tmux on gitlab2001 is CRITICAL: CRIT: Long running tmux process. (user: root PID: 17829, 2344160s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:14:24] (03PS1) 10Daniel Kinzler: WIP: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) [18:18:30] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) p:05Triage→03Medium [18:19:40] (03CR) 10Ottomata: [C: 03+1] Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722289 (owner: 10Jgiannelos) [18:24:05] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 2 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10aaron) [18:43:33] (03Abandoned) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/720865 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [18:52:35] (03PS5) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) [18:59:20] (03CR) 10Michael DiPietro: create role to deploy staging instance for quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [19:01:36] (03PS6) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) [19:08:11] (03CR) 10Michael DiPietro: create role to deploy staging instance for quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [19:11:15] (03CR) 10Michael DiPietro: "This patch should allow us to deploy a staging quarry instance from code instead of using manual steps. Hopefully assisting in maintaining" [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [19:14:11] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 2 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [19:16:10] (03PS1) 10Andrew Bogott: New roles for NFS servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/722418 (https://phabricator.wikimedia.org/T291406) [19:17:10] 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 (10MoritzMuehlenhoff) >>! In T283714#7304593, @fgiunchedi wrote: > I was able to get a working python3-eventlet package by integrating [[... [19:33:14] hi Amir1 ! I think we need to revert the latest mediawiki-config patch [19:33:26] (03PS1) 10Bstorm: cloudnfs: set up a PoC Openstack instance-based nfs server [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) [19:33:43] it seems to have broken CentralNotice [19:34:37] See https://phabricator.wikimedia.org/T291410 [19:34:56] (03CR) 10jerkins-bot: [V: 04-1] cloudnfs: set up a PoC Openstack instance-based nfs server [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [19:35:32] So we need to rewrite those widgets not to use the jqueryui multiselect [19:35:57] but we need CentralNotice admin to keep using while we do that [19:37:03] greg-g: we need to revert a config change asap [19:37:21] (03PS2) 10Bstorm: cloudnfs: set up a PoC Openstack instance-based nfs server [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) [19:38:19] greg-g: basically, turning off JQmigrate broke CentralNotice: https://phabricator.wikimedia.org/T291410 [19:38:20] whom do we ping to check that that's ok? [19:38:41] marostegui effie ^ ? [19:39:36] AndyRussG: Someone with deployer privileges should be able to help, legoktm maybe? [19:39:49] Hi [19:40:17] legoktm: Would you be able to help here? It is pretty late in the EU evening [19:40:25] yep [19:40:31] legoktm: marostegui: hi thanks! ok thanks we also want to check that it's OK to revert [19:40:31] legoktm: thanks a lot <3 [19:41:04] so revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/722348 ? [19:41:10] I assume so, and would be a temporary reversion, while we update/fix the code [19:41:32] it's also disabled on other wikis for a while now, nlwiki, all wikibooks + all wikisources, commons [19:41:41] legoktm: yep that's it! [19:41:50] it's only important on meta, which is group1 [19:42:09] that's where the CentralNotice interface that broke lives [19:42:17] so just reverting that one is fine [19:42:23] (03PS1) 10Legoktm: Revert "Disable jQuery Migrate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722404 (https://phabricator.wikimedia.org/T291410) [19:42:43] would you like me to sync that out for you? [19:43:08] yes, that would be great legoktm! [19:43:18] thank you [19:43:20] (03CR) 10AndyRussG: [C: 03+1] "Thanks!!!!!!! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722404 (https://phabricator.wikimedia.org/T291410) (owner: 10Legoktm) [19:43:21] * legoktm waits for CI [19:43:31] legoktm: that'd be fantastic!, thanks!! [19:43:43] (03CR) 10Legoktm: [C: 03+2] Revert "Disable jQuery Migrate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722404 (https://phabricator.wikimedia.org/T291410) (owner: 10Legoktm) [19:44:29] (03Merged) 10jenkins-bot: Revert "Disable jQuery Migrate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722404 (https://phabricator.wikimedia.org/T291410) (owner: 10Legoktm) [19:45:58] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Revert "Disable jQuery Migrate on group1" (T291410) (duration: 00m 56s) [19:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:04] T291410: Central notice projects and language choices not loading - https://phabricator.wikimedia.org/T291410 [19:46:15] ejegg, AndyRussG: try now? [19:47:12] thanks legoktm, one sec [19:47:17] legoktm: works for me on mwdebug [19:47:31] https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=WikiCari2021 [19:47:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:32] mwdebug1001.equiad [19:48:37] eqiad [19:49:06] not yet without the mwdebug extension enabled tho, I imagine we have to wait a few minutes for RL cache rollover [19:49:20] it should be live everywhere, but yeah [19:49:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:15] legoktm: thanks so so much, yeah! I'll keep trying and will ping in a bit to confirm [19:52:50] legoktm: works great, thanks so so much once again!!! [19:53:09] awesome [20:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T2000). [20:03:07] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 708 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:09:08] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 8 probes of 708 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:22:53] (03CR) 10BryanDavis: [C: 03+2] toolhub: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719475 (owner: 10JMeybohm) [20:26:51] (03Merged) 10jenkins-bot: toolhub: chart type is not valid in apiVersion v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719475 (owner: 10JMeybohm) [20:31:07] (03PS1) 10BryanDavis: toolhub: bump container version to 2021-09-20-194840-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/722424 [20:33:29] ejegg: Hi, sorry for not responding sooner, I'm currently in a stuck train ("fun"). If you're using jq.ui and that's broken, feel free to patch core's jq.ui, it's already forked [20:35:27] Amir1: ah hey no worries, thanks and also apologies for the revert [20:36:13] (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container version to 2021-09-20-194840-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/722424 (owner: 10BryanDavis) [20:36:15] I guess we haven't narrowed down exactly what the right fix will be, but the widget at issue is actually checked into the CentralNotice codebase and upstream has been inactive for... years [20:37:00] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CentralNotice/+/refs/heads/master/resources/vendor/jquery.ui.multiselect/ [20:37:11] Oh that seems as fun as my stuck train. Possibly even more fun. [20:37:12] so I imagine a "proper" fix would be to replace it [20:38:04] I can put meta to the last wikis getting the removal. So don't worry [20:38:29] Amir1: heheh maybe more fun since it drags on much longer? Here is the task, if you're interested https://phabricator.wikimedia.org/T291431 [20:38:42] Amir1: oh if that's a way to prevent things from being blocked, that sounds great, yeah! [20:38:57] what would be your expected timeline for really removing it on Meta then in that case? [20:39:27] A couple of weeks [20:39:52] Amir1: oki sounds like at least a better temporary fix than the revert we just did then :) [20:40:03] (03Merged) 10jenkins-bot: toolhub: bump container version to 2021-09-20-194840-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/722424 (owner: 10BryanDavis) [20:41:07] Yeah, I will reapply the patch soon with meta excluded [20:41:18] Amir1: great thanks! [20:52:51] PROBLEM - Long running screen/tmux on gitlab1001 is CRITICAL: CRIT: Long running tmux process. (user: brennen PID: 14155, 1740317s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:00:04] Reedy and sbassett: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T2100). [21:18:26] (03CR) 10Andrew Bogott: "> archive project vol" [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [21:19:37] (03Abandoned) 10Andrew Bogott: New roles for NFS servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/722418 (https://phabricator.wikimedia.org/T291406) (owner: 10Andrew Bogott) [21:21:48] (03CR) 10Andrew Bogott: "This seems fine but I'm curious about the fork -- is there really nothing we can share with existing NFS puppet code? I imagine that the d" [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [21:25:41] (03CR) 10Bstorm: cloudnfs: set up a PoC Openstack instance-based nfs server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [21:27:00] (03CR) 10Andrew Bogott: [C: 03+1] cloudnfs: set up a PoC Openstack instance-based nfs server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [21:28:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:30:23] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:17] (03PS1) 10Bstorm: labslvm: fix the branch around ephemeral vols [puppet] - 10https://gerrit.wikimedia.org/r/722431 (https://phabricator.wikimedia.org/T277078) [21:38:53] (03CR) 10Bstorm: "This fixes an annoying dependency in the LVM module. It was a mistake I didn't see since I had not rebuilt my test VM enough times. I test" [puppet] - 10https://gerrit.wikimedia.org/r/722431 (https://phabricator.wikimedia.org/T277078) (owner: 10Bstorm) [21:41:37] !log ms-fe1005 - systemctl start swift_dispersion_stats.service (gerrit:719285) [21:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:57] (03CR) 10Dzahn: "This was only on ms-fe1005. confirmed with cumin the others don't have the cron and service starts fine on ms-fe1005" [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:42:59] (03CR) 10Bstorm: [C: 03+2] labslvm: fix the branch around ephemeral vols [puppet] - 10https://gerrit.wikimedia.org/r/722431 (https://phabricator.wikimedia.org/T277078) (owner: 10Bstorm) [21:45:42] (03PS1) 10Dzahn: swift: remove absented cron for dispersion stats [puppet] - 10https://gerrit.wikimedia.org/r/722433 (https://phabricator.wikimedia.org/T273673) [21:46:04] (03CR) 10Bstorm: [C: 03+2] cloudnfs: set up a PoC Openstack instance-based nfs server [puppet] - 10https://gerrit.wikimedia.org/r/722420 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [21:46:48] (03CR) 10Dzahn: [C: 03+2] swift: remove absented cron for dispersion stats [puppet] - 10https://gerrit.wikimedia.org/r/722433 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:51:31] (03PS1) 10Dzahn: swift: convert dispersion-stats-lowlatency to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/722435 (https://phabricator.wikimedia.org/T273673) [21:52:09] (03PS1) 10BryanDavis: toolhub: pin mcrouter image to 0.41.0-4-20210718 [deployment-charts] - 10https://gerrit.wikimedia.org/r/722436 [21:53:57] (03PS1) 10Dzahn: swift: remove absented cron for dispersion-stats-lowlatency [puppet] - 10https://gerrit.wikimedia.org/r/722437 (https://phabricator.wikimedia.org/T273673) [21:54:40] (03CR) 10Dzahn: [C: 03+2] swift: convert dispersion-stats-lowlatency to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/722435 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:54:58] (03CR) 10Dzahn: [C: 03+2] "This is on: ms-fe2005.codfw.wmnet,ms-fe1005.eqiad.wmnet only" [puppet] - 10https://gerrit.wikimedia.org/r/722435 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:55:23] (03PS2) 10Dzahn: swift: convert dispersion-stats-lowlatency to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/722435 (https://phabricator.wikimedia.org/T273673) [22:03:06] (03PS1) 10Dzahn: swift: absent cron for dispersion-stats-lowlatency [puppet] - 10https://gerrit.wikimedia.org/r/722438 (https://phabricator.wikimedia.org/T273673) [22:03:41] (03CR) 10BryanDavis: [C: 03+2] toolhub: pin mcrouter image to 0.41.0-4-20210718 [deployment-charts] - 10https://gerrit.wikimedia.org/r/722436 (owner: 10BryanDavis) [22:05:30] !log changing user email for MIskander (WMF)@collabwiki [22:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:58] (03Merged) 10jenkins-bot: toolhub: pin mcrouter image to 0.41.0-4-20210718 [deployment-charts] - 10https://gerrit.wikimedia.org/r/722436 (owner: 10BryanDavis) [22:08:07] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:08:13] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 1.224e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:08:38] (03PS2) 10Dzahn: swift: absent cron for dispersion-stats-lowlatency [puppet] - 10https://gerrit.wikimedia.org/r/722438 (https://phabricator.wikimedia.org/T273673) [22:10:27] !log wdqs1004 - service wdqs-updater restart [22:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:19] (03CR) 10Dzahn: [C: 03+2] swift: absent cron for dispersion-stats-lowlatency [puppet] - 10https://gerrit.wikimedia.org/r/722438 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:14:23] !log wdqs1004 - depool [22:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:51] bd808 could you file a bug for the mcrouter:latest issue? [22:20:57] legoktm: I totally will... once I prove that is the problem [22:21:17] :) ty [22:21:51] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [22:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:53] Looks like I didn't read the error messages well enough. It is the "docker-registry.wikimedia.org/prometheus-mcrouter-exporter:latest" image that is being rejected, not "docker-registry.discovery.wmnet/mcrouter:latest". [22:28:00] (03PS1) 10BryanDavis: Revert "toolhub: pin mcrouter image to 0.41.0-4-20210718" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722448 [22:37:58] hmm, that was supposed to be fixed in I90cc6babab81790711e43d71850d2191dd7de25c [22:38:38] yeah... `docker image inspect` seems to show "User": "65534" when I pull locally [22:38:50] I wonder if the chart is not re-pulling? [22:39:05] * bd808 digs around some more [22:39:08] it's pullIfNotPresent, so :latest probably has an older copy [22:39:42] what if you try :0.0.1-2-20210919 ? [22:42:56] 1001 has docker-registry.discovery.wmnet/prometheus-mcrouter-exporter latest 1957b8c51410 10 months ago 80.5MB [22:43:02] 1002 has docker-registry.wikimedia.org/prometheus-mcrouter-exporter latest c1cf680608ca 3 weeks ago 79.7MB [22:43:30] (03PS2) 10Dzahn: swift: remove absented cron for dispersion-stats-lowlatency [puppet] - 10https://gerrit.wikimedia.org/r/722437 (https://phabricator.wikimedia.org/T273673) [22:46:49] (03CR) 10Dzahn: [C: 03+2] swift: remove absented cron for dispersion-stats-lowlatency [puppet] - 10https://gerrit.wikimedia.org/r/722437 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:47:21] (03CR) 10BryanDavis: [C: 03+2] Revert "toolhub: pin mcrouter image to 0.41.0-4-20210718" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722448 (owner: 10BryanDavis) [22:49:50] legoktm: I'm going to try to "fix" by changing the pull policy for my charts to Always. There are a bunch of :latest tags on sidecar bits and its going to be random what happens with them otherwise. [22:50:46] why not pin them to specific versions? [22:51:12] 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Dzahn) [22:51:26] (03Merged) 10jenkins-bot: Revert "toolhub: pin mcrouter image to 0.41.0-4-20210718" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722448 (owner: 10BryanDavis) [22:51:36] legoktm: well... so I don't have to change N tags every time something like prometheus-mcrouter-exporter gets a bump mostly. [22:52:12] There certainly can be more than one opinion on this stuff though. [22:52:41] I want "it just works and SREs are not sad that things are out of date" with the least work for everyone possible. [22:53:35] "pull_policy: Always" will not actually always pull. It will always ask the registry for the hash of the image and then pull only if the hash is not cached on the node. [22:54:09] for me at least I think having coordinated transitions / upgrades and not having things randomly break when they get upgraded underneath me is more valuable [22:55:10] I think in most other places we have exporters/sidecars pinned? [22:55:16] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10MBinder_WMF) I am able to SSH into ssh mbinder@phab1001.eqiad.wmnet. I haven't tried running a bulk command, but I suspect as long as I can get in it will work as usual. I think I'... [22:57:49] The mwdebug helmfile.d is pinning some, but not all of them (not this one specifically), but maybe that is an oversight in that implementation rather than a designed intent. [22:58:48] legoktm: so if I do pin the sidecar containers, how will I or someone else know to update the pins? Like is there a feed of "new image published for X" that I should be watching somewhere? [22:59:17] * bd808 knows this is all still evolving [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210920T2300). [23:00:05] musikanimal and tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:17] I would think that the person upgrading the image would grep and file tasks (or one shared task) for other charts using it to bump their version [23:00:21] I'm here! [23:00:37] I imagine soon we'll get to test this process when we start moving containers from buster to bullseye [23:00:39] o/ [23:01:11] I can do the deploys [23:01:18] unless something is going on? [23:01:32] not that I'm aware of [23:02:26] tgr: I'm just doing k8s stuff. Not anywhere near mw bits [23:02:35] cool, thx [23:04:54] (03PS2) 10Gergő Tisza: Enable DisamiguatorNotifications on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721902 (https://phabricator.wikimedia.org/T291303) (owner: 10MusikAnimal) [23:05:44] (03PS1) 10Gergő Tisza: AddLink: Skip over headings in phrase matching [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722449 (https://phabricator.wikimedia.org/T291361) [23:06:29] (03CR) 10Gergő Tisza: [C: 03+2] Enable DisamiguatorNotifications on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721902 (https://phabricator.wikimedia.org/T291303) (owner: 10MusikAnimal) [23:07:17] (03Merged) 10jenkins-bot: Enable DisamiguatorNotifications on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721902 (https://phabricator.wikimedia.org/T291303) (owner: 10MusikAnimal) [23:11:28] tgr: tested working :) thank you! [23:11:48] (03PS1) 10BryanDavis: toolhub: set "docker.pull_policy: Always" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722472 (https://phabricator.wikimedia.org/T291442) [23:13:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:32] (03CR) 10Gergő Tisza: [C: 03+2] AddLink: Skip over headings in phrase matching [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722449 (https://phabricator.wikimedia.org/T291361) (owner: 10Gergő Tisza) [23:20:23] (03CR) 10BryanDavis: [C: 03+2] toolhub: set "docker.pull_policy: Always" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722472 (https://phabricator.wikimedia.org/T291442) (owner: 10BryanDavis) [23:22:08] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/nda group - https://phabricator.wikimedia.org/T291391 (10Dzahn) I can confirm it's normal that WMDE employees should be in BOTH groups, wmde and nda. And since Georgina is already in the puppet admin module in the "ldap_only" section a... [23:22:10] !log LDAP - added georginaburnett-wmde to NDA group (T291391, T273780) [23:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:19] T273780: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 [23:22:20] T291391: Request to add Georgina Burnett to the ldap/nda group - https://phabricator.wikimedia.org/T291391 [23:22:46] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/nda group - https://phabricator.wikimedia.org/T291391 (10Dzahn) 05Open→03Resolved [23:26:24] (03Merged) 10jenkins-bot: toolhub: set "docker.pull_policy: Always" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722472 (https://phabricator.wikimedia.org/T291442) (owner: 10BryanDavis) [23:29:00] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [23:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:37] (03CR) 10jerkins-bot: [V: 04-1] AddLink: Skip over headings in phrase matching [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722449 (https://phabricator.wikimedia.org/T291361) (owner: 10Gergő Tisza) [23:39:35] (03CR) 10Gergő Tisza: [C: 03+2] "Selenium false positive:" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722449 (https://phabricator.wikimedia.org/T291361) (owner: 10Gergő Tisza) [23:39:46] (03CR) 10Gergő Tisza: [C: 03+2] AddLink: Skip over headings in phrase matching [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722449 (https://phabricator.wikimedia.org/T291361) (owner: 10Gergő Tisza) [23:41:18] (03PS1) 10Krinkle: ci: Apply profile::wmcs::lvm as needed for new integration instances [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) [23:41:53] (03PS2) 10Krinkle: ci: Apply profile::wmcs::lvm as needed for new integration instances [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) [23:42:52] (03PS6) 10Huji: Temporarily disable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) [23:48:53] (03CR) 10Krinkle: "@Hashar This is currently cherry-picked on the integration puppetmaster to ensure qemu-agent-* provision without errors on bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [23:49:08] (03CR) 10Krinkle: "@Hashar This is currently cherry-picked on the integration puppetmaster to ensure qemu-agent-* provision without errors on bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [23:50:33] (03PS1) 10Bstorm: cloudnfs: switch packages to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/722477 (https://phabricator.wikimedia.org/T291406) [23:53:52] (03CR) 10Bstorm: [C: 03+2] cloudnfs: switch packages to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/722477 (https://phabricator.wikimedia.org/T291406) (owner: 10Bstorm) [23:59:54] (03Merged) 10jenkins-bot: AddLink: Skip over headings in phrase matching [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/722449 (https://phabricator.wikimedia.org/T291361) (owner: 10Gergő Tisza)