docs/updates.md

   1 # News and Status Updates
   2
   3 Please contact jan.mandel@ucdenver.edu with any questions.
   4
   5 Reload the page to see the latest information. Your browser may be caching an old version.
   6
   7 Real-time &nbsp; [Alderaan Temperature Log](https://demo.openwfm.org/web/alderaan/temp.txt) &nbsp; &nbsp;
   8  [Status](https://demo.openwfm.org/web/alderaan/cpu_temp.txt) &nbsp; &nbsp;
   9  [CPU Load](https://demo.openwfm.org/web/alderaan/cpu.txt) &nbsp; &nbsp;
  10  [Memory](https://demo.openwfm.org/web/alderaan/mem.txt) &nbsp; &nbsp;
  11  [Swap](https://demo.openwfm.org/web/alderaan/swp.txt) &nbsp; &nbsp;
  12  [Partitions](https://demo.openwfm.org/web/alderaan/sinfo.txt)
  13
  14
  15 **To protect the computer hardware, should Alderaan CPUs get too hot, the jobs running on them are
  16 suspended automatically. This should happen only rarely now. The jobs resume after the temperature drops, which
  17 should not take more than few minutes. However, to protect the datacenter, jobs will not resume if the datacenter temperature is too high. **
  18 Please see [CPU temperature](https://demo.openwfm.org/web/alderaan/cpu_temp.txt) for details.
  19
  20 ### 2023/09/19
  21
  22 * 11:54pm: math-alderaan not accepting user ssh connections, filesystems dropped. Rebooted.
  23
  24 ### 2023/09/10
  25
  26 * 7:43pm math-alderaan not accepting ssh connections
  27
  28 * 11:10pm The math-alderaan head node was stuck. Rebooted remotely.
  29
  30 * When you can't log into math-alderaan, please use the alternative head node clas-compute.ucdenver.pvt. All user files are there and you can submit slurm jobs as usual.
  31
  32 ### 2023/08/08
  33
  34 * MATLAB was upgraded to R2023a with all toolboxes in CU Denver license installed. This is now the
  35 default on alderaan nodes.
  36
  37 To use the previous installation, type first
  38
  39     module load matlab/R2021b
  40
  41 ### 2023/07/14
  42
  43 * 2:30pm: Alderaan compute nodes automatically shutting down due to high data center temperature, jobs start getting suspended.
  44
  45 * 3:10-4:10: Complete outage of all compute nodes.
  46
  47 * 4:30: Jobs resuming automatically, operations normal.
  48
  49 ### 2023/07/08
  50
  51 * approx 2:00pm: Login issues to math-alderaan and clas-compute head nodes reported.
  52 The file server is down, systems not accessible until further notice.
  53
  54 * approx 8pm: service restored, operations normal.
  55
  56 ### 2023/05/18
  57
  58 * approx 2:30pm: Operations normal
  59
  60 * approx 12pm: Network outage this morning was fixed, but login issues persist.
  61 Please check here or try later. Thanks for your patience!
  62
  63
  64 ### 2023/04/21
  65
  66 * Guide [how to use Personal Globus endpoint](../globus) for data transfer is now available. It was ported from the legacy [wiki](http://ccm.ucdenver.edu), updated, and tested. Globus can transfer large quantities of data (many TB) and work through firewalls.
  67
  68
  69 ### 2023/04/08
  70
  71 * Various nodes are draining for heat testing under load. After the current jobs
  72 on them complete, no new jobs will be able to start on them
  73 until the testing is completed.
  74
  75 * Node math-alderaan-c07 remains unavailable.
  76
  77 ### 2023/04/04
  78
  79 * The data center is too warm for running Alderaan nodes at full CPU load.
  80 Jobs on nodes that are running too hot are getting suspended automatically
  81 until the CPUs cool down, in particular math-alderaan-c[05,07].
  82 See the real-time
  83 [Status](https://demo.openwfm.org/web/alderaan/cpu_temp.txt) for more detail.
  84 The temperature cutoffs were adjusted lower to keep the system from overheating.
  85
  86 * Node math-alderaan-c07 is out for repair. Its temperature rise was too fast and cycling could not keep it at safe temperature.
  87
  88 ### 2023/03/07
  89
  90 * Optimization solver Gurobi with one year site license added to the  /storage/singularity/pyscipopt-geopandas.sif container.
  91
  92 ### 2023/02/20
  93
  94 * Front end math-alderaan is back online. Operations normal.
  95
  96 ### 2023/02/18
  97
  98 * Front end math-alderaan is down.
  99 The Alderaan cluster is accessible through the alternate front end by
 100
 101      ssh clas-compute.ucdenver.pvt
 102
 103 * Slurm and all compute nodes are working normally.
 104
 105 * Modules and custom software installed in /shared are not available.
 106 Other filesystems are not affected.
 107
 108 * System monitoring is not being updated.
 109
 110 * Note that some project directories and the /scratch directory are in /data001 and /data002 filesystems, which are not accessible from clas-compute head node, colibri cluster, and the score cluster.
 111
 112 ### 2023/02/10
 113
 114 * math-alderaan-h02 is available
 115
 116 ### 2023/02/08
 117
 118 * Taking math-alderaan-h02 down for diagnostics/repair
 119 * math-alderaan-c[01-04] are back
 120
 121 ### 2023/02/03
 122
 123 * Node math-alderaan-h02 in drain state for GPU diagnostics, please do not use
 124 * 10:40pm: Node math-alderaan-h02 available
 125
 126 ### 2023/01/23
 127
 128 * [Hands-on workshop](../training/)
 129
 130 ### 2022/12/07
 131
 132 * Nodes math-alderaan-c[01-04] are with the vendor for repair.
 133
 134 ### 2022/12/06
 135
 136 * Nodes math-alderaan-c[02-04] are draining. They will be powered off tomorrow at 2pm and any jobs on them killed. The chassis with nodes math-alderaan-c[01-04] needs to be sent to the vendor for repairs.
 137
 138 ### 2022/12/04
 139
 140 * Node math-alderaan-c01 still down until further notice
 141
 142 * 1pm sbatch error resolved, operations normal.
 143
 144 * 11pm Users unable to submit slurm jobs, error "sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified"
 145
 146 ### 2022/12/03
 147
 148 * 11pm The filesystem mounts on math-alderaan and several compute nodes were found to be dropped. This was causing problems including users not being able to login, access files, submit jobs, as well as issues with jobs already running. The filesystems were mounted again.
 149
 150 * Data center power outage 7pm-9pm
 151
 152 ### 2022/11/23
 153
 154 * Node math-alderaan-c01 is down again. Investigating.
 155
 156 ### 2022/11/14
 157
 158 * Jobs submitted by sbatch or srun may not be starting properly for some users intermitently, possibly due to authentication or network issues. Investigating. Please let me know at jan.mandel@ucdenver.edu if you see this happen.
 159
 160 ### 2022/11/04
 161
 162 * The `/storage/singularity/mixtures.sif`container was updated and several packages added. The old container is at `/storage/singularity/archive/mixtures-nov2-2022.sif`
 163 * All operations are normal.
 164
 165 ### 2022/09/24
 166
 167 * Partition /storage/math/projects partition back under 50% utilization. All operations normal.
 168
 169 ### 2022/09/23
 170
 171 * 10am The 40TB /storage/math/projects partition is 100% full. Until this is corrected no one can add any files there. Moving the largest user directories to /scratch and contacting the users individually. The /scratch directory is accessible from all alderaan nodes but not /colibri or /score.
 172
 173 * 4pm /storage/math/projects is at 96%. Please do not put any large new files there, make a directory in /scratch instead.
 174
 175 ### 2022/09/14
 176
 177 *  Node math-alderaan-c01 is back. All operations normal.
 178
 179 ### 2022/09/13
 180
 181 *  Node math-alderaan-c01 is down.
 182
 183 ### 2022/09/12
 184
 185 *  All Colibri nodes are now available. The cause was a power issue in a network switch.
 186
 187 ### 2022/09/05
 188
 189 * 5pm [Large scale network outage at CU Denver](https://stspg.io/cm2wj5ff4k89), clusters not accessible. Last Alderaan updates are from 4:31pm. The cause is a power outage.
 190
 191 * 11:58pm OIT restored the power. All running jobs were killed by the power interruption. All alderaan nodes were reset and are now back. Colibri nodes continue to be not accessible.
 192
 193 ### 2022/09/04
 194
 195 * Node math-alderaan-c01 is down. I'll work on it after the Labor Day weekend.
 196
 197 ### 2022/09/02
 198
 199 * All Colibri compute nodes math-colibri-c[01-24] and also math-colibri-i01 are not accessible. No ETA at this point. Please let me know if you need those nodes urgently. The large memory interactive node math-colibri-i02 works normally.
 200
 201 ### 2022/08/18
 202
 203 * NetCDF C and Fortran libraries rebuilt with the updated Intel compiler 2022.1.0. `module load intel` and `module load netcdf` will automatically select the latest versions. Please do `module purge` to start clean when loading  modules and assure a predictable environment.
 204
 205 ### 2022/08/16
 206
 207 * Software maintenance planned on math-alderaan-h[01-02] is postponed. Please continue using the existing GPU directions in the [Clusters guide](../clusters_guide/#how-to-run-with-gpu-on-alderaan).
 208
 209 ### 2022/08/15
 210
 211 * 4pm Node math-alderaan-c01 is back, all nodes operational.
 212
 213 * Hardware maintenance on math-alderaan-c01 and several other compute nodes from about 1pm. The nodes will be put in drain state in advance. Nodes suspended for CPU overheating will be included and not resumed automatically.  Other Alderaan nodes should not be affected but connectivity may be limited temporarily.
 214
 215 ### 2022/08/11
 216
 217 * Node math-alderaan-c01 is down.
 218
 219 ### 2022/08/08
 220
 221 * Intel BaseKit and HPCKit (compilers, debugger, libraries) updated to current version. Do <code>module avail</code> to see what is there.
 222
 223 ### 2022/08/06
 224
 225 * Nodes math-alderaan-h[01-02] are draining to prepare for scheduled maintenance.
 226
 227 ### 2022/08/04
 228
 229 * 5pm: Maintenance completed, operations normal.
 230
 231 * 10am: Maintenance started: continue moving nodes and cables improve air flow and adding fan strips.  Nodes math-alderann-c[01-12] will be powered off. Other nodes and functionality may be affected too.
 232
 233 * 12am: Nodes math-alderaan-c[01-12] are draining, no new jobs can start on them. Existing jobs can continue while the nodes remain up. Any nodes suspended automatically for CPU overheating will remain suspended until the maintenance is completed.
 234
 235 ### 2022/08/02
 236
 237 * Maintenance (rack reconfiguration to improve cooling) is scheduled to continue 8/4 with nodes math-alderaan-c[01-12], which will be down. Other nodes may be affected for shorter periods.
 238
 239 * 8pm: Planned hardware maintenance completed. All alderaan nodes work normally. Please let me know if you see any issues.
 240
 241 * 9am: Planned maintenance in progress. Alderaan not available.
 242
 243 ### 2022/07/31
 244
 245 * Nodes math-alderaan-c[29-32] fixed, math-alderaan-c[13-26,28] still offline.
 246   Slurm and temperature monitoring work with all available nodes normally,
 247   and the nodes can be used at full load.
 248
 249 ### 2022/07/30
 250
 251 * Nodes math-alderaan-c[13-26] and math-alderaan-c[28-32] are offline.
 252   The slurm scheduler works with the remaining nodes normally.
 253   Temperature monitoring works normally.
 254
 255 * Maintenance is scheduled to continue Tuesday 8/2/2022.
 256
 257 ### 2022/07/29
 258
 259 * Maintenance of nodes math-alderaan-c[13-32] to improve cooling is planned for Friday 7/29 9am-3pm.
 260
 261 * Please continue to run jobs, just know that they may be interrupted for maintenance. The downtime of individual nodes will be kept to a minimum possible.
 262
 263 ### 2022/07/28
 264
 265 * In preparation for maintenance on 7/29, nodes math-alderaan-c[13-32] are draining. The nodes will be resumed one by one as soon as possible.
 266
 267 * Thermal management was modified temporarily so that it does not resume suspended nodes (and jobs on them) automatically. Since the ambient temperature is low enough for nothing to get suspended, this is not expected to make a difference
 268
 269 ### 2022/07/27
 270
 271 * Reconfiguration of slurm to recognize GPUs as a resource in progress. Please let me know should you see any unusual behavior.
 272
 273 * The data center temperature is lower now. Job should not be getting suspended because of temperature any more, or only rarely.
 274
 275 * The cause of the downtime of math-alderaan-c[29-32] was found and it should be corrected by the end of the day Friday 7/29.
 276
 277 ### 2022/07/21
 278
 279 * Node math-alderaan-c01 is back online. Nodes math-alderaan-c[29-32] are down, investigating. No jobs were cancelled.
 280
 281 * The [TDP](https://community.amd.com/t5/processors/what-do-amd-mean-by-tdp/td-p/221727) on math-alderaan-c01 and math-alderaan-c07 was changed.
 282 Their availability will be limited until testing is completed.
 283
 284 ### 2022/07/18
 285
 286 * Node math-alderaan-c01 is down.
 287
 288 ### 2022/07/15
 289
 290 * The high-memory/GPU nodes math-alderaan-h[01-02] are back in operation.
 291
 292 ### 2022/07/14
 293
 294 * The high memory/GPU nodes are down. Investigating.
 295
 296 * One way to avoid getting jobs suspended is to use fewer cores per node. Since the CPU turbo
 297 boost feature will speed up the remaining cores and the load depends on application,
 298 the number of cores per node to use is best determined by trial and error.
 299
 300 ### 2022/07/12
 301
 302 * To protect the computer hardware, jobs running on CPUs which get too hot are
 303 suspended automatically. The jobs resume after the temperature drops.
 304
 305 * We also increase the speed of
 306 the node fans proactively as the CPU temperatures increase.
 307
 308
 309 ### 2022/07/09
 310
 311 * Cooling and temperature monitoring were improved. **All Alderaan nodes can be used at 100%
 312 load safely.**
 313
 314 * Should a [CPU temperature](https://demo.openwfm.org/web/alderaan/cpu_temp.txt)  exceed
 315 a limit, the jobs using the CPU will be suspended automatically and  can be resumed later
 316 after a review of the situation. The node state will show as `drng` in the
 317 [partitions list](https://demo.openwfm.org/web/alderaan/sinfo.txt).
 318
 319 [//]: # (Reducing  the number of cores used has some effect on the CPU heat generated, but only a limited one because the remaining cores can boost their speed up.)
 320
 321 * A link to real-time CPU temperature on all Alderaan nodes was added above.
 322
 323 ### 2022/07/08
 324
 325 * 1:15pm: Normal operations resumed.
 326
 327 * 12:30pm: A/C offline, operations suspended.
 328
 329 * 11:00am: Data center was improved. Nodes math-alderaan-c[01-32] resumed. Please do go ahead and submit your jobs and use all  nodes at 100% again.
 330
 331 ### 2022/07/07
 332
 333 * Because of CPU overhearing, no new jobs can start on nodes math-alderaan-c[01-32] and existing jobs on nodes loaded more than 80% were killed or suspended.
 334 Arrangements to use the nodes at reduced load are possible while the heat situation is being resolved, please contact jan.mandel@ucdenver.edu.
 335
 336 * Node math-alderaan-c01 reset and returned to operations.
 337
 338 ### 2022/07/03
 339
 340 * Node math-alderaan-c01 failed and won't power on.
 341
 342 ### 2022/06/21
 343 * Thanks to all who submitted their contributions for the annual report for the [NSF grant](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2019089) Alderaan is funded from!
 344
 345 * New modules available on Alderaan: module load intel will set up the Intel compiters and MPI; module load netcdf will point environment variable NETCDF to both C and Fortran NetCDF as expected by many software packages. Separate modules netcdf-c and netcdf-fortran are also available. All NetCDF modules are built with the Intel compilers.
 346
 347 ### 2022/06/14
 348
 349 * Slurm configuration with GPUs and memory as controlled resources is coming soon. In the meantime, **please do not request an entire high memory/GPU node if you do not need all the resources, request only the cores you need**.
 350
 351 * 1pm: Maintenance completed. Nodes math-alderaan-h01 and math-alderaan-h02 have two GPUS each now. Operations normal.
 352
 353 * 11am: Maintenance started, taking node math-alderaan-h01 offline.
 354
 355 ### 2022/06/13
 356
 357 * The maintenance on math-alderaan-h01 and the return of math-alderaan-h02 are postponed to tomorrow 6/14 11am.
 358
 359 ### 2022/06/12
 360
 361 * All jobs will be now suspended automatically when alderaan inlet temperature reaches 29 C to help prevent a data center overheating emergency.
 362 Normal operations will resume when the temperature returns to at most 25 C. Please check the temperature log above if your jobs are suspended or submitted jobs do not start.
 363
 364 ### 2022/06/11
 365
 366 * 9:40pm Temperature back to normal 25C, all jobs resumed, normal operations resumed.
 367
 368 * 5pm Datacenter temperature 30C. All alderaan jobs suspended an no new jobs can start to help prevent overheating.
 369
 370 ### 2022/06/10
 371
 372 * Node math-alderaan-h01 will be powered off Monday 6/13 afternoon to add a second GPU. All running jobs will be killed. The node will be put in draining state in
 373 advance so that no new jobs can start. Node math-alderann-h02 will be put  back, upgraded to two GPUs. Other nodes should not be affected.
 374
 375 * SLURM reconfiguration to allocate also GPUs and memory at least in the math-alderaan-gpu partition is coming soon.
 376
 377 * Forwarded from XSEDE: Texas A&M University **FASTER** (Fostering Accelerated Scientific Transformations, Education, and Research) is a novel composable high-performance data-analysis and computing instrument funded by the NSF MRI program. **FASTER** adopts the innovative Liqid composable software-hardware approach combined with cutting-edge technologies such as Intel Ice Lake CPUs, NVIDIA A100/A40/A30/A10/T4 GPUs, NVMe based storage, and high-speed Infiniband HDR interconnect. **FASTER** is a 184-node cluster built by Dell and has 40 A100, 200 T4, 8 A40, 8 A10, and 4 A30 GPUs. Each compute node can compose multiple GPUs of various types via Liqid PCIe fabrics. The **FASTER** platform removes significant bottlenecks in research computing by leveraging composable technology that can dynamically integrate disaggregated GPUs to a single node, allowing HPC/AI workflows to flexibly choose the type and number of GPUs to fit their needs. Thirty percent of **FASTER’s** computing resources are allocated to researchers nationwide by XSEDE/ACCESS program. **FASTER** is open as friendly user mode to XSEDE Startup allocations now and invites researchers who are interested in becoming **FASTER** users to submit allocation requests.  More details about **FASTER** can be found: [https://portal.xsede.org/tamu-faster](https://portal.xsede.org/tamu-faster)
 378
 379
 380 ### 2022/06/03
 381
 382 * NEW: Real-time system status added to this updates page, check out the links above.
 383
 384 ### 2022/06/02
 385
 386 * 2pm: Power redistribution and testing completed without tripping any breakers. Normal operations resumed. All existing jobs continued normally. All nodes available except math-alderaan-h02 out for maintenance.
 387
 388 * 11am: All nodes draining. Power reconfiguration and testing are scheduled to start at 1pm. Existing jobs should be able to continue unless a power load test trips circuit breakers.
 389
 390 ### 2022/05/28
 391
 392 * All clusters operate normally. All nodes showing as available can be used at full load.
 393
 394 ### 2022/05/27
 395
 396 * Work on power distribution was completed for the day about 1pm. No jobs were cancelled. Nodes math-alderaan-c[01,05,09,13,17,21,25,29] are offline to reduce the maximum load and avoid potential shutdown over the weekend.  The rest operates normally.
 397
 398 * Node math-alderaan-c18 had a memory board replaced and it is back online. Node math-alderaan-h02 is out for maintenance until further notice.
 399
 400 ### 2022/05/26
 401
 402 * Work on power distribution and stress testing was completed about 2pm for today and all clusters are available.
 403
 404 * Nodes math-alderaan-c18 and math-alderaan-h02 are down for repair until further notice.
 405
 406 * Alderaan will be down 2022/05/27 from about 10:30am to continue work on power distribution. The clas-compute front end, the score cluster, and the colibri cluster should not be affected.
 407
 408 * Announcements switched from emails to this page, announced in login message on front ends.
 409