[00:05:38] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:24] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:09:28] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.151 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[02:10:34] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 8.283e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:11:16] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[02:18:35] <wikibugs>	 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Platonides) It's still sending the announcement-only mail, but the cron is working now. :-) :-)  he code was already almost working. I simply changed a debugging...
[03:13:26] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 7.738e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[03:17:04] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 7.734e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[04:48:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui)
[04:50:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui) @Bstorm clouddb1019 and clouddb1020 are on this rack. @razzi dbstore1007 is on this rack
[04:52:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui)
[04:53:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui)
[04:54:48] <wikibugs>	 10SRE, 10Analytics, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) @Bstorm I assume we are ok with having a glitch on clouddb1017 and 1018? @razzi dbstore1005 is on this rack.
[04:55:11] <wikibugs>	 10SRE, 10Analytics, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui)
[04:57:23] <wikibugs>	 10SRE, 10Analytics, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui)
[05:00:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui)
[06:43:08] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service Muehlenhoff Used for Buster migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:15] <wikibugs>	 (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175
[06:44:17] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175 (owner: 10QChris)
[06:47:44] <_joe_>	 !log restart wdqs-updater on wdqs1007
[06:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:45] <moritzm>	 !log start rasdaemon on sretest1001, didn't start after last reboot from a week ago
[06:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:28] <wikibugs>	 (03PS2) 10QChris: Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175
[06:49:47] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175 (owner: 10QChris)
[06:50:16] <wikibugs>	 (03PS1) 10QChris: Import done. Revoke import grants [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703176
[06:50:20] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703176 (owner: 10QChris)
[06:50:32] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff)
[07:00:04] <jouncebot>	 Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210705T0700)
[07:03:54] <_joe_>	 uhm I think I need to completely restart blazegraph on wdqs1007
[07:04:27] <_joe_>	 !log restarting blazegraph, then restarting the updater again
[07:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:02] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 9.114e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[07:08:04] <_joe_>	 ok that's good news, it started processing again
[07:17:52] <wikibugs>	 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Joe) 05Open→03Resolved I don't think the changelog-creating part should really be part of this task. What we've done now should be enough for the original intended scope of the task....
[07:29:30] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) Looking at Ganeti VMs, they broadly fall under three/four categories:  SPOF, will need a maintenance window decl...
[07:35:08] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui)
[07:35:49] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui)
[07:36:21] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui)
[07:38:02] <wikibugs>	 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Marostegui) p:05Triage→03Medium
[07:55:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on releases2002.codfw.wmnet with reason: bump CPU count
[07:55:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases2002.codfw.wmnet with reason: bump CPU count
[07:55:35] <wikibugs>	 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10ops-monitoring-bot) Icinga downtime set by jmm@cumin2002 for 0:30:00 1 host(s) and their services with reason: bump CPU count ` releases2002.codfw.wmnet `
[07:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on releases1002.eqiad.wmnet with reason: bump CPU count
[08:03:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases1002.eqiad.wmnet with reason: bump CPU count
[08:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:41] <wikibugs>	 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10ops-monitoring-bot) Icinga downtime set by jmm@cumin2002 for 0:30:00 1 host(s) and their services with reason: bump CPU count ` releases1002.eqiad.wmnet `
[08:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:48] <wikibugs>	 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @dancy: Ack, I've bumped the CPU count to 8. If there's still performance bottlenecks going forward, please reopen the task  ` jmm@cumin2002:~$...
[08:15:21] <moritzm>	 !log rolling out debmonitor-client 0.3.0
[08:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:16] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[08:45:56] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[09:10:22] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481)
[09:10:50] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] "Do not merge until we've gotten approval in T284481." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan)
[09:11:54] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:24:30] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) Looking at Ganeti VMs, they fall under three/four categories:  SPOF, will need a maintenance window declared: * otrs1001 * an-tool1008 * an-tool10...
[09:30:38] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) Looking at Ganeti VMs, they fall under three/four categories:  SPOF, will need a maintenance window declared: * an-tool1005 * an-to...
[09:35:32] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff)
[09:36:39] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff)
[09:40:10] <wikibugs>	 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff)
[09:41:44] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff)
[09:45:51] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff)
[09:55:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 9:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo)
[10:00:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/702883 (owner: 10Volans)
[10:04:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm optional comment" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 (owner: 10Volans)
[10:05:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 (owner: 10Volans)
[10:10:41] <wikibugs>	 (03PS1) 10Majavah: Read running tools from grid-webservices tool [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/703189 (https://phabricator.wikimedia.org/T284564)
[10:10:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM," [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans)
[10:11:07] <wikibugs>	 (03PS2) 10Jbond: Use IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans)
[10:11:08] <wikibugs>	 (03CR) 10Majavah: [C: 04-2] "until `webservice` changes are deployed" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/703189 (https://phabricator.wikimedia.org/T284564) (owner: 10Majavah)
[10:12:12] <wikibugs>	 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10jbond)
[10:12:35] <wikibugs>	 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10jbond) I think this would potentially bve a good candidate to port to the concat module
[10:13:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[10:20:23] <wikibugs>	 (03CR) 10H.krishna123: "Thanks Jaime for review. Good point, I will look into running pylint locally as well. Most of the errors seem to be on the SQL query, so I" [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123)
[10:22:50] <wikibugs>	 (03CR) 10H.krishna123: "> Patch Set 2:" (033 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123)
[10:24:09] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7161333, @Legoktm wrote: > @joe is everything in this ticket now covered by Shellbox?  No, this ticket is about adding an Ingress controller i...
[10:27:00] <jbond>	 !log disable puppet fleet wide to preforem puppetdb change
[10:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb::app: Use seperate user for the read databse [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond)
[10:29:31] <marostegui>	 !log Optimize ruwiki.logging on s6 eqiad with replication T286102
[10:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:39] <stashbot>	 T286102: Please optimize logging table in ruwiki - https://phabricator.wikimedia.org/T286102
[10:37:44] <jbond>	 !log enable puppet  fleet wide to post puppetdb change
[10:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:54] <moritzm>	 !log upgrading PHP on miscweb*
[10:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:38] <moritzm>	 !log installing tiff security updates on stretch
[11:07:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:27] <moritzm>	 !log installing openexr security updates on stretch
[11:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:48] <wikibugs>	 (03PS1) 10Marostegui: db1122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703204
[11:45:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703204 (owner: 10Marostegui)
[12:14:32] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:15:16] <wikibugs>	 (03PS1) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098)
[12:48:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1125 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/703207 (https://phabricator.wikimedia.org/T286042)
[12:48:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1125 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/703207 (https://phabricator.wikimedia.org/T286042) (owner: 10Marostegui)
[12:50:21] <marostegui>	 !log Stop MySQL on db1117:3321 to clone db1125 T286042
[12:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:31] <stashbot>	 T286042: Move db1124 and db1125 to misc services temporarily - https://phabricator.wikimedia.org/T286042
[12:55:20] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[12:55:33] <marostegui>	 ^ me
[12:56:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[12:56:25] <marostegui>	 ^ me
[12:59:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe)
[13:01:24] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Joe)
[13:02:34] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Joe)
[13:20:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[13:20:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[13:49:01] <wikibugs>	 (03PS1) 10Marostegui: db1141: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/703210
[13:49:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1141: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/703210 (owner: 10Marostegui)
[13:53:05] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) >>! In T280497#7194633, @wkandek wrote: > Is this the dashboard? https://grafana.wikimedia.org/d/U7JT--knk/joe-k8s-mwdebug?viewPanel=70&orgId=1&from=1625227688488&to=1625246654342...
[13:54:17] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) a:03Joe **Architecture:** Contour works decoupling the management layer (contour itself) from the proxying one (using envoy): the first is deployed as a deployment in kubernetes, and needs t...
[14:01:48] <moritzm>	 !log uploaded nginx 1.13.9-1+wmf3 for stretch-wikimedoa
[14:01:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:33] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) **Docker images needed**: We get to pick two directions: * If we use the helm chart, we would need a recent version of envoy (easy) and one image for contour. * If we want to use the operatior...
[14:15:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) **Deployment** There are two options, similar to istio: * use contour-operator, which also allows to control the runtime status of the cluster (for instance allowing zero-downtime envoy upgrad...
[14:17:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10MoritzMuehlenhoff)
[14:18:56] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:25] <wikibugs>	 (03PS3) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456)
[14:22:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Add separate role for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703213 (https://phabricator.wikimedia.org/T286206)
[14:22:44] <wikibugs>	 (03PS2) 10Muehlenhoff: Add separate role for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703213 (https://phabricator.wikimedia.org/T286206)
[14:22:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[14:38:46] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 1 of VMs requested for MX - https://phabricator.wikimedia.org/T286208 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:45:30] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:46] <wikibugs>	 (03PS2) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098)
[15:09:12] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10wkandek)
[15:24:46] <_joe_>	 !log leaving wdqs1007 depooled so that the updater can recover faster, now at 16.5 hours of lag
[15:24:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:34] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:18:12] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Legoktm) a:03Legoktm
[17:23:04] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10bd808) @aborrero may be able to provide some information from his past work to setup ingress-nginx for Toolforge.
[17:29:29] <wikibugs>	 (03PS1) 10Legoktm: nodejs10-devel/stretch: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212)
[17:37:02] <wikibugs>	 (03PS2) 10Legoktm: nodejs10-devel/stretch: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212)
[17:37:10] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] nodejs10-devel/stretch: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212) (owner: 10Legoktm)
[17:40:42] <legoktm>	 !log published fixed docker-registry.discovery.wmnet/nodejs10-devel:0.0.4 image (T286212)
[17:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:52] <stashbot>	 T286212: docker-registry.wikimedia.org/nodejs10-devel container after 0.0.3 does not include `npm` - https://phabricator.wikimedia.org/T286212
[17:55:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman doesn't replace email in notice when changing subscription email - https://phabricator.wikimedia.org/T286149 (10Legoktm)
[17:56:22] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ - https://phabricator.wikimedia.org/T285949 (10Legoktm) 05Open→03Resolved
[18:00:04] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: lists-next: bad name in “welcome” email - https://phabricator.wikimedia.org/T278433 (10Legoktm) 05Open→03Resolved a:03Legoktm This was fixed when we upgraded Postorius.
[18:02:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists-next: no clickable link in “confirm” email - https://phabricator.wikimedia.org/T278432 (10Legoktm) 05Open→03Resolved a:03Legoktm I fixed this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/683555/
[18:04:30] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Legoktm)
[18:06:08] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: In Mailman3, users cannot change their display name from the web - https://phabricator.wikimedia.org/T283128 (10Legoktm)
[18:20:04] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:20:58] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:20:22] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:25] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:19] <wikibugs>	 (03PS1) 10Legoktm: mailman3: Discard all mails with a X-Spam-Score >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218)
[22:34:55] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:40:14] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:29:49] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) I couldn't figure out how to spoof bounces either, so I subscribed `doesnotexist@wikimedia.org` and lowered the bounce threshold on test4, and manipulated the dates in...
[23:39:54] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) >>! In T285361#7198288, @Legoktm wrote: > But per https://polymorphic.lists.wmcloud.org/postorius/lists/test4.polymorphic.lists.wmcloud.org/members/member/ it doesn't l...
[23:40:48] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Appears to mostly work in VPS. If what I wrote on T285361#7198289 sounds good to you then I think we're ready to enable this." [puppet] - 10https://gerrit.wikimedia.org/r/701658 (https://phabricator.wikimedia.org/T285361) (owner: 10Ladsgroup)
[23:47:59] <wikibugs>	 (03PS2) 10Legoktm: mailman3: Discard all mails with a X-Spam-Score >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218)