docs/updates.md

   1 # Updates
   2
   3 Please contact jan.mandel@ucdenver.edu with any questions.
   4
   5 Our goal is to get the maximum performance the hardware and the cooling allow. When Alderaan CPUs get too hot, they are slowed down adaptively to avoid heat damage. If the datacenter overheats, all alderaan nodes shut down automatically.
   6
   7
   8 ### 2024/10/21
   9
  10 * 11am Partial downtime. Several nodes wil have their thermal profile lowered to reduce overheating and cycling of CPU speed under load. The nodes cycle the CPU governor settings profile automatically to mitigate excessive CPU temperatures, which may result in inconsistent performance.
  11
  12 * 1pm Maintenance completed. The thermal design profiles of several nodes were decreased to 180W or less in few cases. One failed drive in the math-alderaan-s02 disk array was replaced. Operations normal.
  13
  14 ### 2024/10/16
  15
  16 * Please access the cluster only through the `math-alderaan` login node. The legacy login node `clas-compute` may give ssh errors or not respond at all.
  17 * Several nodes are draining or in drain state to prepare for maintenance Monday 2024/10/21.
  18
  19 ### 2024/09/24
  20
  21 * 7pm Several nodes inaccessible or down after an earlier network outage.
  22 * 9pm All alderaan and score nodes normal.
  23
  24 ### 2024/09/11
  25
  26 * Matlab working now. Reason: the license server has changed.
  27
  28 ### 2024/09/10
  29
  30 * Matlab license not working.
  31
  32 ### 2024/07/25
  33
  34 * Nodes math-alderaan-c[02,13,24] had issues that made at least some jobs allocated to them fail, sometimes with no files created.
  35   Reason: time synchronization. Fixed.
  36
  37 ### 2024/07/12
  38
  39 * **Storage migration in progress - please do not add or modify large volume of files in /home and /storage** The /home and /storage directories are on a slower legacy file server and commands (like conda) and jobs using files there are getting stuck at times. For this reason, I have been migrating /home and /storage to high-performance Alderaan disk arrays over the last few months. It takes up to a day to synchronize /home and /storage with their new location even if nothing changed, plus a few hours for every hundred GB in new or modified files. The final synchronization and the switch to the new location will be done in a scheduled downtime (to be announced soon) on quiet data.  **A large volume of new or modified files in /home and /storage will make the downtime longer.** Deleting or moving files away is OK, and smaller changes are OK too. You are welcome to keep running jobs, I'd just like to encourage you to write new files to your directory in /data001/projects; the easiest may be to run your jobs from there.
  40
  41 ### 2024/07/01
  42
  43 * The downtime ended, all should be normal. There will need to be continuation, which will be announced in due course.
  44
  45 ### 2024/06/28
  46
  47 PLANNED DOWNTIME FRIDAY 6/28 starting at 6pm
  48
  49 * The /home directories are being migrated to a higher performace storage to fix issues such as processes in indefinite D (disk wait) state, which has been making slurm jobs and conda stuck in many cases.
  50     This last phase of the migration requires a quiet system to transfer the last changes and switch the storage. Therefore, **all logins will be blocked and all Slurm jobs cancelled when the downtime starts.**
  51
  52 * Submission of new slurm jobs will be turned off from Thursday 6/27/2024 6 pm
  53
  54 * To help keep the downtime shorter, please avoid creating or modifying large files or a large number of files in your home directory. Use your directory in /data001/projects instead.
  55
  56 ### 2024/06/27
  57
  58 * Matlab license fixed
  59 * **All partitions will be marked down at 6pm in preparation of the scheduled downtime tomorrow**
  60
  61 ### 2024/06/26
  62
  63 * MATLAB license not working again
  64
  65 ### 2024/06/24
  66
  67 * MATLAB not working because it cannot contact license server. Investigating.
  68   - Update: fixed
  69 * SLURM job submission may not work. Existing jobs should not be affected. Investigating.
  70   - Update: the issue produces warnings but job submission now appears normal.
  71
  72 ### 2024/04/26
  73
  74 * 10am-12pm Planned maintenance time for network testing and configuration. From about 9am, logins will be disabled, partitions stopped, and running jobs suspended or cancelled. Some nodes may be rebooted. Suspended jobs may be able to continue after the maintenance ends, but it is not guaranteed.
  75
  76 * 3pm Maintenance completed. Please let me know if you see any errors.
  77
  78 ### 2024/04/12
  79
  80 * 2:30pm Files  in /home and /storage mounts dropped on randomly changing compute nodes. Investigating. All partitions are stopped, no new jobs will start. Existing jobs are allowed to continue, though they may have a difficulty doing so.
  81 * 5pm All nodes were rebooted and the issue has resolved. All partition queues started.
  82
  83 ### 2024/04/11
  84
  85 * 4pm Slurm is down. Investigating.
  86 * 7pm Fixed
  87 * Users' directories in /data001/projects were moved to /data002/projects because of issues with /data001 and replaced by soft links to the new locations. Therefore, the original location shows the same files and it can still be used. No action required at this point.
  88
  89 ### 2024/04/10
  90
  91 * **Maintenance is planned from 11am**
  92 Several nodes are draining and partition math-alderaan-short is not accepting jobs. Nodes which become available will be rebooted.
  93
  94 ### 2024/03/26
  95
  96 * **Maintenance is planned  for Tuesday 3/26 from 10am.**
  97 The math-alderaan partition will keep running, but only on a subset of nodes, which will be
  98 shrinking as idle nodes become reserved for the upcoming maintance. I will try to maintain at least one idle node for new jobs.
  99 The math-alderaan-short and math-alderaan-gpu-short partitions will not accept new jobs starting Monday 3/25 10am so that any jobs running on them can finish within the 1 day partition limits.
 100 The math-alderaan-gpu partition is not not accepting any new jobs until maintenance is complete.
 101 No jobs will be cancelled.
 102
 103 * Node math-alderaan-c30 has a bad memory slot and runs with memory reduced to 480GB. It will not be allocated to jobs and currently is reserved for system service.
 104
 105 ### 2024/03/21
 106
 107 * **Alderaan maintenance planned** From about 10am, nodes which become idle will be rebooted to update their CPU power and heat settings further and returned to operation one by one. The rest of the nodes with jobs running on them will be updated later. No jobs will be cancelled.
 108
 109 ### 2024/03/20
 110
 111 * 10:45pm All nodes are currently draining and no jobs can start on them to prepare for maintenance tomorrow.
 112
 113 * Node math-alderaan-c06 is down and being sent for repair.
 114
 115 ### 2024/03/18
 116
 117 * **3pm Alderaan maintenance completed**.  math-alderaan-c[02,14] are still draining and will have their heat envelope reset when they and I become available at the same time. math-alderaan-c06 has a bad memory board and will be down until fixed. All other nodes are available. Please let me know if you note anything odd.
 118
 119 ### 2024/03/15
 120
 121 **Alderaan maintenance in progress**. I am decreasing a maximum generated heat setting (called TDP) on all Alderaan CPUs. This should decrease the switching of
 122
 123 * the CPUs to a slower power saving mode when they overheat, and thus result in smoother and more reliable operation within the available cooling capacity.
 124 The change was already done on Alderaan nodes c01 c05 c11 c31 c32, which will keep running normally. Currently, no new jobs can start on any remaining Alderaan nodes, including GPU nodes, but existing jobs are allowed to complete. Alderaan nodes that will have no jobs running on them by Monday 3/18 will be rebooted and have their TDP reset. All Alderaan nodes are expected to be available by the end of the day. No jobs will be cancelled.
 125 If you need to run something urgently between now and Monday and the large number of unavailable nodes is a problem, please let me know.
 126
 127 ### 2024/03/13
 128
 129 * 16 nodes are currently draining. When jobs on them end, no new jobs will start.
 130 The plan is to have at least 10 nodes with no jobs running on them available for
 131 maintenance planned 2024/03/18.
 132
 133 ### 2024/03/11
 134
 135 Singularity containers workshop 11am in SCB 4017.
 136
 137 ### 2024/03/04
 138
 139 * 08:12am Login to math-alderaan.ucdenver.pvt does not work. Investigating. Please use clas-compute.ucdenver.edu to submit and access slurm jobs for now.
 140 * Rebooted, all normal.
 141
 142 ### 2024/02/24
 143
 144 * Added a switch to CPU power saving mode in case of overheating, as an intermediate step before suspending nodes, which should be happening only rarely now.
 145
 146 ### 2024/01/26
 147
 148 * Added an independent monitor with 3 levels of emergency shutdown in case of datacenter overheating: 1. suspend all jobs 2. graceful shutdown of all compute and storage nodes  3. complete poweroff (not reversible remotely)
 149
 150 ### 2024/01/25
 151
 152 * 10:30am All alderaan nodes are powered up. Held jobs released and running.
 153
 154 ### 2024/01/21
 155
 156 * 11:30am The head nodes math-alderaan and clas-compute are up. User files are accessible. Score nodes are available but alderaan nodes are down until further notice.
 157
 158 ### 2024/01/20
 159
 160 * 02:00pm All systems shut down because of datacenter cooling issue.
 161
 162 ### 2023/12/02
 163
 164 We are currently undergoing some important updates and maintenance on our HPC system. Here’s what you need to know:
 165
 166 * **Slurm Reconfiguration:** This is in progress to enhance job isolation and scheduling. Jobs currently running will remain unaffected. However, there might be temporary changes in the behavior when submitting new jobs. Please report any unexpected issues.
 167
 168 * **GPU Nodes Update:** The nodes math-alderaan-h[01-02] are temporarily offline for testing and configuration. During this period, no new jobs will be started on these nodes.
 169
 170 * **Memory Board Replacement:** Nodes math-alderaan-c[29-32] are being prepared for maintenance due to a faulty memory board on math-alderaan-c29. These nodes will be powered down for the installation of a replacement memory board when it is received. New jobs will not start on these nodes until the maintenance is complete.
 171
 172 * **Heat Diagnostics:** Nodes math-alderaan-c[01,05,07,10] are currently offline for heat diagnostics. No jobs can be initiated on these nodes during this period.
 173
 174 * **Access Restrictions:** As a reminder, please avoid SSH access directly to any nodes, especially those that are currently drained or draining.
 175
 176 We appreciate your understanding and cooperation during this maintenance phase. Our goal is to ensure a more robust and efficient HPC environment for all users. Thank you for your patience.
 177
 178 ### 2023/10/19
 179
 180 * 08:30am Head node math-alderan stuck, rebooted.
 181
 182 ### 2023/09/27
 183
 184 * 08:50pm Head node math-alderan stuck, rebooted.
 185
 186 ### 2023/09/21
 187
 188 * Workshop: An Introduction to Computing on the Cluster (11AM-12PM in UC Denver SCB 4125)
 189
 190 * All math-alderaan nodes returned to service after the workshop except 01 02 05 07 remain in drain state for heat stress testing.
 191
 192 * Due to no cpus available to start new jobs, all user accounts were reset to default maximum 500 concurrent cpus. There are some cpus available now (nodes showing as idle or mixed).
 193
 194 * The maximum concurrent cpus limit can be increased on request or temporarily silently if the cluster is underutilized.
 195
 196 ### 2023/09/20
 197
 198 * Several nodes are draining and max concurrent cpus of large array users was reset in preparation for the workshop at 11am.
 199
 200 ### 2023/09/19
 201
 202 * 11:54pm: math-alderaan not accepting user ssh connections, filesystems dropped. Rebooted.
 203
 204 ### 2023/09/10
 205
 206 * 7:43pm math-alderaan not accepting ssh connections
 207
 208 * 11:10pm The math-alderaan head node was stuck. Rebooted remotely.
 209
 210 * When you can't log into math-alderaan, please use the alternative head node clas-compute.ucdenver.pvt. All user files are there and you can submit slurm jobs as usual.
 211
 212 ### 2023/08/08
 213
 214 * MATLAB was upgraded to R2023a with all toolboxes in CU Denver license installed. This is now the
 215 default on alderaan nodes.
 216
 217 To use the previous installation, type first
 218
 219     module load matlab/R2021b
 220
 221 ### 2023/07/14
 222
 223 * 2:30pm: Alderaan compute nodes automatically shutting down due to high data center temperature, jobs start getting suspended.
 224
 225 * 3:10-4:10: Complete outage of all compute nodes.
 226
 227 * 4:30: Jobs resuming automatically, operations normal.
 228
 229 ### 2023/07/08
 230
 231 * approx 2:00pm: Login issues to math-alderaan and clas-compute head nodes reported.
 232 The file server is down, systems not accessible until further notice.
 233
 234 * approx 8pm: service restored, operations normal.
 235
 236 ### 2023/05/18
 237
 238 * approx 2:30pm: Operations normal
 239
 240 * approx 12pm: Network outage this morning was fixed, but login issues persist.
 241 Please check here or try later. Thanks for your patience!
 242
 243
 244 ### 2023/04/21
 245
 246 * Guide [how to use Personal Globus endpoint](../globus) for data transfer is now available. It was ported from the legacy [wiki](http://ccm.ucdenver.edu), updated, and tested. Globus can transfer large quantities of data (many TB) and work through firewalls.
 247
 248
 249 ### 2023/04/08
 250
 251 * Various nodes are draining for heat testing under load. After the current jobs
 252 on them complete, no new jobs will be able to start on them
 253 until the testing is completed.
 254
 255 * Node math-alderaan-c07 remains unavailable.
 256
 257 ### 2023/04/04
 258
 259 * The data center is too warm for running Alderaan nodes at full CPU load.
 260 Jobs on nodes that are running too hot are getting suspended automatically
 261 until the CPUs cool down, in particular math-alderaan-c[05,07].
 262 See the real-time
 263 [Status](https://demo.openwfm.org/web/alderaan/cpu_temp.txt) for more detail.
 264 The temperature cutoffs were adjusted lower to keep the system from overheating.
 265
 266 * Node math-alderaan-c07 is out for repair. Its temperature rise was too fast and cycling could not keep it at safe temperature.
 267
 268 ### 2023/03/07
 269
 270 * Optimization solver Gurobi with one year site license added to the  /storage/singularity/pyscipopt-geopandas.sif container.
 271
 272 ### 2023/03/06
 273
 274 * Workshop: Introduction to Shell Scripting on Alderaan (12:30-1:30 p.m MST Hybrid)
 275
 276 ### 2023/02/20
 277
 278 * Front end math-alderaan is back online. Operations normal.
 279
 280 ### 2023/02/18
 281
 282 * Front end math-alderaan is down.
 283 The Alderaan cluster is accessible through the alternate front end by
 284
 285      ssh clas-compute.ucdenver.pvt
 286
 287 * Slurm and all compute nodes are working normally.
 288
 289 * Modules and custom software installed in /shared are not available.
 290 Other filesystems are not affected.
 291
 292 * System monitoring is not being updated.
 293
 294 * Note that some project directories and the /scratch directory are in /data001 and /data002 filesystems, which are not accessible from clas-compute head node, colibri cluster, and the score cluster.
 295
 296 ### 2023/02/10
 297
 298 * math-alderaan-h02 is available
 299
 300 ### 2023/02/08
 301
 302 * Taking math-alderaan-h02 down for diagnostics/repair
 303 * math-alderaan-c[01-04] are back
 304
 305 ### 2023/02/03
 306
 307 * Node math-alderaan-h02 in drain state for GPU diagnostics, please do not use
 308 * 10:40pm: Node math-alderaan-h02 available
 309
 310 ### 2023/01/23
 311
 312 * [Hands-on workshop](../training/)
 313
 314 ### 2022/12/07
 315
 316 * Nodes math-alderaan-c[01-04] are with the vendor for repair.
 317
 318 ### 2022/12/06
 319
 320 * Nodes math-alderaan-c[02-04] are draining. They will be powered off tomorrow at 2pm and any jobs on them killed. The chassis with nodes math-alderaan-c[01-04] needs to be sent to the vendor for repairs.
 321
 322 ### 2022/12/04
 323
 324 * Node math-alderaan-c01 still down until further notice
 325
 326 * 1pm sbatch error resolved, operations normal.
 327
 328 * 11pm Users unable to submit slurm jobs, error "sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified"
 329
 330 ### 2022/12/03
 331
 332 * 11pm The filesystem mounts on math-alderaan and several compute nodes were found to be dropped. This was causing problems including users not being able to login, access files, submit jobs, as well as issues with jobs already running. The filesystems were mounted again.
 333
 334 * Data center power outage 7pm-9pm
 335
 336 ### 2022/11/23
 337
 338 * Node math-alderaan-c01 is down again. Investigating.
 339
 340 ### 2022/11/14
 341
 342 * Jobs submitted by sbatch or srun may not be starting properly for some users intermitently, possibly due to authentication or network issues. Investigating. Please let me know at jan.mandel@ucdenver.edu if you see this happen.
 343
 344 ### 2022/11/04
 345
 346 * The `/storage/singularity/mixtures.sif`container was updated and several packages added. The old container is at `/storage/singularity/archive/mixtures-nov2-2022.sif`
 347 * All operations are normal.
 348
 349 ### 2022/09/24
 350
 351 * Partition /storage/math/projects partition back under 50% utilization. All operations normal.
 352
 353 ### 2022/09/23
 354
 355 * 10am The 40TB /storage/math/projects partition is 100% full. Until this is corrected no one can add any files there. Moving the largest user directories to /scratch and contacting the users individually. The /scratch directory is accessible from all alderaan nodes but not /colibri or /score.
 356
 357 * 4pm /storage/math/projects is at 96%. Please do not put any large new files there, make a directory in /scratch instead.
 358
 359 ### 2022/09/14
 360
 361 *  Node math-alderaan-c01 is back. All operations normal.
 362
 363 ### 2022/09/13
 364
 365 *  Node math-alderaan-c01 is down.
 366
 367 ### 2022/09/12
 368
 369 *  All Colibri nodes are now available. The cause was a power issue in a network switch.
 370
 371 ### 2022/09/05
 372
 373 * 5pm [Large scale network outage at CU Denver](https://stspg.io/cm2wj5ff4k89), clusters not accessible. Last Alderaan updates are from 4:31pm. The cause is a power outage.
 374
 375 * 11:58pm OIT restored the power. All running jobs were killed by the power interruption. All alderaan nodes were reset and are now back. Colibri nodes continue to be not accessible.
 376
 377 ### 2022/09/04
 378
 379 * Node math-alderaan-c01 is down. I'll work on it after the Labor Day weekend.
 380
 381 ### 2022/09/02
 382
 383 * All Colibri compute nodes math-colibri-c[01-24] and also math-colibri-i01 are not accessible. No ETA at this point. Please let me know if you need those nodes urgently. The large memory interactive node math-colibri-i02 works normally.
 384
 385 ### 2022/08/18
 386
 387 * NetCDF C and Fortran libraries rebuilt with the updated Intel compiler 2022.1.0. `module load intel` and `module load netcdf` will automatically select the latest versions. Please do `module purge` to start clean when loading  modules and assure a predictable environment.
 388
 389 ### 2022/08/16
 390
 391 * Software maintenance planned on math-alderaan-h[01-02] is postponed. Please continue using the existing GPU directions in the [Clusters guide](../clusters_guide/#how-to-run-with-gpu-on-alderaan).
 392
 393 ### 2022/08/15
 394
 395 * 4pm Node math-alderaan-c01 is back, all nodes operational.
 396
 397 * Hardware maintenance on math-alderaan-c01 and several other compute nodes from about 1pm. The nodes will be put in drain state in advance. Nodes suspended for CPU overheating will be included and not resumed automatically.  Other Alderaan nodes should not be affected but connectivity may be limited temporarily.
 398
 399 ### 2022/08/11
 400
 401 * Node math-alderaan-c01 is down.
 402
 403 ### 2022/08/08
 404
 405 * Intel BaseKit and HPCKit (compilers, debugger, libraries) updated to current version. Do <code>module avail</code> to see what is there.
 406
 407 ### 2022/08/06
 408
 409 * Nodes math-alderaan-h[01-02] are draining to prepare for scheduled maintenance.
 410
 411 ### 2022/08/04
 412
 413 * 5pm: Maintenance completed, operations normal.
 414
 415 * 10am: Maintenance started: continue moving nodes and cables improve air flow and adding fan strips.  Nodes math-alderann-c[01-12] will be powered off. Other nodes and functionality may be affected too.
 416
 417 * 12am: Nodes math-alderaan-c[01-12] are draining, no new jobs can start on them. Existing jobs can continue while the nodes remain up. Any nodes suspended automatically for CPU overheating will remain suspended until the maintenance is completed.
 418
 419 ### 2022/08/02
 420
 421 * Maintenance (rack reconfiguration to improve cooling) is scheduled to continue 8/4 with nodes math-alderaan-c[01-12], which will be down. Other nodes may be affected for shorter periods.
 422
 423 * 8pm: Planned hardware maintenance completed. All alderaan nodes work normally. Please let me know if you see any issues.
 424
 425 * 9am: Planned maintenance in progress. Alderaan not available.
 426
 427 ### 2022/07/31
 428
 429 * Nodes math-alderaan-c[29-32] fixed, math-alderaan-c[13-26,28] still offline.
 430   Slurm and temperature monitoring work with all available nodes normally,
 431   and the nodes can be used at full load.
 432
 433 ### 2022/07/30
 434
 435 * Nodes math-alderaan-c[13-26] and math-alderaan-c[28-32] are offline.
 436   The slurm scheduler works with the remaining nodes normally.
 437   Temperature monitoring works normally.
 438
 439 * Maintenance is scheduled to continue Tuesday 8/2/2022.
 440
 441 ### 2022/07/29
 442
 443 * Maintenance of nodes math-alderaan-c[13-32] to improve cooling is planned for Friday 7/29 9am-3pm.
 444
 445 * Please continue to run jobs, just know that they may be interrupted for maintenance. The downtime of individual nodes will be kept to a minimum possible.
 446
 447 ### 2022/07/28
 448
 449 * In preparation for maintenance on 7/29, nodes math-alderaan-c[13-32] are draining. The nodes will be resumed one by one as soon as possible.
 450
 451 * Thermal management was modified temporarily so that it does not resume suspended nodes (and jobs on them) automatically. Since the ambient temperature is low enough for nothing to get suspended, this is not expected to make a difference
 452
 453 ### 2022/07/27
 454
 455 * Reconfiguration of slurm to recognize GPUs as a resource in progress. Please let me know should you see any unusual behavior.
 456
 457 * The data center temperature is lower now. Job should not be getting suspended because of temperature any more, or only rarely.
 458
 459 * The cause of the downtime of math-alderaan-c[29-32] was found and it should be corrected by the end of the day Friday 7/29.
 460
 461 ### 2022/07/21
 462
 463 * Node math-alderaan-c01 is back online. Nodes math-alderaan-c[29-32] are down, investigating. No jobs were cancelled.
 464
 465 * The [TDP](https://community.amd.com/t5/processors/what-do-amd-mean-by-tdp/td-p/221727) on math-alderaan-c01 and math-alderaan-c07 was changed.
 466 Their availability will be limited until testing is completed.
 467
 468 ### 2022/07/18
 469
 470 * Node math-alderaan-c01 is down.
 471
 472 ### 2022/07/15
 473
 474 * The high-memory/GPU nodes math-alderaan-h[01-02] are back in operation.
 475
 476 ### 2022/07/14
 477
 478 * The high memory/GPU nodes are down. Investigating.
 479
 480 * One way to avoid getting jobs suspended is to use fewer cores per node. Since the CPU turbo
 481 boost feature will speed up the remaining cores and the load depends on application,
 482 the number of cores per node to use is best determined by trial and error.
 483
 484 ### 2022/07/12
 485
 486 * To protect the computer hardware, jobs running on CPUs which get too hot are
 487 suspended automatically. The jobs resume after the temperature drops.
 488
 489 * We also increase the speed of
 490 the node fans proactively as the CPU temperatures increase.
 491
 492
 493 ### 2022/07/09
 494
 495 * Cooling and temperature monitoring were improved. **All Alderaan nodes can be used at 100%
 496 load safely.**
 497
 498 * Should a [CPU temperature](https://demo.openwfm.org/web/alderaan/cpu_temp.txt)  exceed
 499 a limit, the jobs using the CPU will be suspended automatically and  can be resumed later
 500 after a review of the situation. The node state will show as `drng` in the
 501 [partitions list](https://demo.openwfm.org/web/alderaan/sinfo.txt).
 502
 503 [//]: # (Reducing  the number of cores used has some effect on the CPU heat generated, but only a limited one because the remaining cores can boost their speed up.)
 504
 505 * A link to real-time CPU temperature on all Alderaan nodes was added above.
 506
 507 ### 2022/07/08
 508
 509 * 1:15pm: Normal operations resumed.
 510
 511 * 12:30pm: A/C offline, operations suspended.
 512
 513 * 11:00am: Data center was improved. Nodes math-alderaan-c[01-32] resumed. Please do go ahead and submit your jobs and use all  nodes at 100% again.
 514
 515 ### 2022/07/07
 516
 517 * Because of CPU overhearing, no new jobs can start on nodes math-alderaan-c[01-32] and existing jobs on nodes loaded more than 80% were killed or suspended.
 518 Arrangements to use the nodes at reduced load are possible while the heat situation is being resolved, please contact jan.mandel@ucdenver.edu.
 519
 520 * Node math-alderaan-c01 reset and returned to operations.
 521
 522 ### 2022/07/03
 523
 524 * Node math-alderaan-c01 failed and won't power on.
 525
 526 ### 2022/06/21
 527 * Thanks to all who submitted their contributions for the annual report for the [NSF grant](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2019089) Alderaan is funded from!
 528
 529 * New modules available on Alderaan: module load intel will set up the Intel compiters and MPI; module load netcdf will point environment variable NETCDF to both C and Fortran NetCDF as expected by many software packages. Separate modules netcdf-c and netcdf-fortran are also available. All NetCDF modules are built with the Intel compilers.
 530
 531 ### 2022/06/14
 532
 533 * Slurm configuration with GPUs and memory as controlled resources is coming soon. In the meantime, **please do not request an entire high memory/GPU node if you do not need all the resources, request only the cores you need**.
 534
 535 * 1pm: Maintenance completed. Nodes math-alderaan-h01 and math-alderaan-h02 have two GPUS each now. Operations normal.
 536
 537 * 11am: Maintenance started, taking node math-alderaan-h01 offline.
 538
 539 ### 2022/06/13
 540
 541 * The maintenance on math-alderaan-h01 and the return of math-alderaan-h02 are postponed to tomorrow 6/14 11am.
 542
 543 ### 2022/06/12
 544
 545 * All jobs will be now suspended automatically when alderaan inlet temperature reaches 29 C to help prevent a data center overheating emergency.
 546 Normal operations will resume when the temperature returns to at most 25 C. Please check the temperature log above if your jobs are suspended or submitted jobs do not start.
 547
 548 ### 2022/06/11
 549
 550 * 9:40pm Temperature back to normal 25C, all jobs resumed, normal operations resumed.
 551
 552 * 5pm Datacenter temperature 30C. All alderaan jobs suspended an no new jobs can start to help prevent overheating.
 553
 554 ### 2022/06/10
 555
 556 * Node math-alderaan-h01 will be powered off Monday 6/13 afternoon to add a second GPU. All running jobs will be killed. The node will be put in draining state in
 557 advance so that no new jobs can start. Node math-alderann-h02 will be put  back, upgraded to two GPUs. Other nodes should not be affected.
 558
 559 * SLURM reconfiguration to allocate also GPUs and memory at least in the math-alderaan-gpu partition is coming soon.
 560
 561 * Forwarded from XSEDE: Texas A&M University **FASTER** (Fostering Accelerated Scientific Transformations, Education, and Research) is a novel composable high-performance data-analysis and computing instrument funded by the NSF MRI program. **FASTER** adopts the innovative Liqid composable software-hardware approach combined with cutting-edge technologies such as Intel Ice Lake CPUs, NVIDIA A100/A40/A30/A10/T4 GPUs, NVMe based storage, and high-speed Infiniband HDR interconnect. **FASTER** is a 184-node cluster built by Dell and has 40 A100, 200 T4, 8 A40, 8 A10, and 4 A30 GPUs. Each compute node can compose multiple GPUs of various types via Liqid PCIe fabrics. The **FASTER** platform removes significant bottlenecks in research computing by leveraging composable technology that can dynamically integrate disaggregated GPUs to a single node, allowing HPC/AI workflows to flexibly choose the type and number of GPUs to fit their needs. Thirty percent of **FASTER’s** computing resources are allocated to researchers nationwide by XSEDE/ACCESS program. **FASTER** is open as friendly user mode to XSEDE Startup allocations now and invites researchers who are interested in becoming **FASTER** users to submit allocation requests.  More details about **FASTER** can be found: [https://portal.xsede.org/tamu-faster](https://portal.xsede.org/tamu-faster)
 562
 563
 564 ### 2022/06/03
 565
 566 * NEW: Real-time system status added to this updates page, check out the links above.
 567
 568 ### 2022/06/02
 569
 570 * 2pm: Power redistribution and testing completed without tripping any breakers. Normal operations resumed. All existing jobs continued normally. All nodes available except math-alderaan-h02 out for maintenance.
 571
 572 * 11am: All nodes draining. Power reconfiguration and testing are scheduled to start at 1pm. Existing jobs should be able to continue unless a power load test trips circuit breakers.
 573
 574 ### 2022/05/28
 575
 576 * All clusters operate normally. All nodes showing as available can be used at full load.
 577
 578 ### 2022/05/27
 579
 580 * Work on power distribution was completed for the day about 1pm. No jobs were cancelled. Nodes math-alderaan-c[01,05,09,13,17,21,25,29] are offline to reduce the maximum load and avoid potential shutdown over the weekend.  The rest operates normally.
 581
 582 * Node math-alderaan-c18 had a memory board replaced and it is back online. Node math-alderaan-h02 is out for maintenance until further notice.
 583
 584 ### 2022/05/26
 585
 586 * Work on power distribution and stress testing was completed about 2pm for today and all clusters are available.
 587
 588 * Nodes math-alderaan-c18 and math-alderaan-h02 are down for repair until further notice.
 589
 590 * Alderaan will be down 2022/05/27 from about 10:30am to continue work on power distribution. The clas-compute front end, the score cluster, and the colibri cluster should not be affected.
 591
 592 * Announcements switched from emails to this page, announced in login message on front ends.
 593