clusters Archives - Microway https://www.microway.com/tag/clusters/ We Speak HPC & AI Tue, 28 May 2024 04:23:18 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Power and Cooling (Multiple-Computer Installations) https://www.microway.com/knowledge-center-articles/power-and-cooling-multiple-computer-installations/ https://www.microway.com/knowledge-center-articles/power-and-cooling-multiple-computer-installations/#respond Tue, 30 Jul 2013 21:51:12 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=2663 This article applies to groups of computers which consume 1,000+ Watts of electricity. Review your quote or contact us to determine how much your systems will require (the power and cooling numbers are typically listed at the end of the quote – below the total cost of the system). Acoustic Considerations – Fan Noise Microway’s […]

The post Power and Cooling (Multiple-Computer Installations) appeared first on Microway.

]]>
This article applies to groups of computers which consume 1,000+ Watts of electricity. Review your quote or contact us to determine how much your systems will require (the power and cooling numbers are typically listed at the end of the quote – below the total cost of the system).

Acoustic Considerations – Fan Noise

Microway’s server systems provide intelligent cooling fans which ramp up and down with system load. However, most systems are very high-density with more than 1,000 Watts consumed per rack unit. Cooling such systems requires high-speed fans which generate significant noise. Most cluster systems will generate 60 dB to 80 dB. This noise level may be subject to safety regulations (e.g., OSHA) and will be audible from other offices in the building. It is recommended that your cluster be placed in a location designed for servers.

If noise is a concern, many of Microway’s workstations – most notably the WhisperStation – are designed to be very quiet and comfortable for an office environment. Quiet HPC Clusters built from Microway WhisperStations are available. If you have any concerns, discuss these matters with your salesperson. In general, if it’s not a WhisperStation you should consider whether the system is appropriate for a lab or office environment.

Power

Microway servers feature auto-switching power supplies which accept both 120V and 208V power, so there are a variety of options when powering multiple computers. The most common electrical circuits are pictured below. Your salesperson can provide a customized recommendation including the rackmount cabinet, power distribution units and optional UPS backup power.

In general, 208V is recommended over 120V. Using 208V allows more systems to run on a single circuit, and each system runs several percent more efficiently.
Diagram of common high power electrical receptacles

120V 20A NEMA 5-20 and L5-20 Electrical Outlets

After a 20% de-rating for safety, these circuits supply up to 1,920 Watts of power.

120V 30A NEMA L5-30 Electrical Outlets

After a 20% de-rating for safety, these circuits supply up to 2,880 Watts of power.

208V 20A NEMA L6-20 Electrical Outlets

After a 20% de-rating for safety, these circuits supply up to 3,328 Watts of power.

208V 30A NEMA L6-30 Electrical Outlets

After a 20% de-rating for safety, these circuits supply up to 4,992 Watts of power.

3-Phase 208V NEMA L21-20 Electrical Outlets

After safety de-rating, these circuits supply up to 5,700 Watts of power.

3-Phase 208V CS 8365 Electrical Outlets

After safety de-rating, these circuits supply up to 14,400 Watts of power. Depending upon which PDU is selected, the actual load on each PDU may need to be less – 10kW, 12.6kW or 14.4kW.

Before assuming a particular circuit layout will be sufficient for your new equipment, review:

  • Is other equipment already connected to the electrical circuit? It is common for multiple outlets to connect to the same circuit, so don’t assume that an empty outlet means power is available. You will need to perform an inventory of the power required by your existing equipment.
  • Particularly for smaller circuits: will you be able to evenly split your servers across multiple circuits? Three 1,000W systems will not successfully connect to two 1,500W circuits. Determine how much electricians will charge for installation of multiple circuits, because it is likely that a single large circuit will be a better choice.

Cooling

Groups of computers typically require special cooling arrangements – your building air-conditioning is designed for offices and will not be able to keep up with the load of compute servers. Note that some facilities reduce or shut down air-conditioning during holidays and weekends. Systems will overheat if they are run in a closed room without sufficient cooling.

A server room or datacenter is the best location for your systems. Microway servers and clusters are designed for installation in industry-standard rackmount cabinets, so you should have no concern when using your own cabinets. We can also provide cabinets with network and power cables pre-wired.

Many facilities specify a maximum power load per rackmount cabinet. You may be restricted to only 7kW or 10kW per cabinet, which would result in racks which are only half-filled. To be certain, present the cooling requirements (provided by your salesperson) to your facilities manager. Your facilities personnel will have details on the cooling load limitations of the datacenter/server room.

The post Power and Cooling (Multiple-Computer Installations) appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/power-and-cooling-multiple-computer-installations/feed/ 0
Common Maintenance Tasks (Clusters) https://www.microway.com/knowledge-center-articles/common-maintenance-tasks-clusters/ https://www.microway.com/knowledge-center-articles/common-maintenance-tasks-clusters/#respond Tue, 30 Jul 2013 21:08:03 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=2651 The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers). Backup non-replaceable data Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate […]

The post Common Maintenance Tasks (Clusters) appeared first on Microway.

]]>
The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers).

Backup non-replaceable data

Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this task or you will forget.

Compute clusters are built from a large group of computers, so there are many different places for data to hide. Make users aware of your backup policies and be certain they aren’t storing vital data on the compute nodes. Let them know which areas are scratch space (for temporary files) and which areas are regularly backed up and designed for user data.

Strongly consider keeping a backup image of the entire head node installation (including a copy of the compute node software image). Bare-metal recovery software is available if you’re not certain how to do this yourself.

As for the user data:

  • For many groups, a weekly or monthly cron job is fine. Write a script calling rsync or tar which writes the files to a separate server, NAS or SAN. Place the script in /etc/cron.weekly/ or /etc/cron.monthly/
  • Users with more complex requirements should look at AMANDA or Bacula
  • Tape backup systems are still available for those who prefer them. Contact us.

Verify the health of your Storage

Drive sectors can go bad silently. Scheduling regular verifies will weed out any issues before they occur. Automate them or you will forget.

  • Linux Software RAID (mdadm) arrays can be easily kicked into verify mode. Many distributions (Red Hat, CentOS, Ubuntu) come with their own utilities. To manually start a verify, run this line for each RAID (as root):
    echo check > /sys/block/md#/md/sync_action
    Watch the text file /proc/mdstat and the output of dmesg to watch the status of each verify.
  • Hardware RAID controllers provide their own methods for automated verifies and alert notification. Reference the controller’s manual.
  • Enterprise and parallel storage systems typically provide their own management interfaces (separate from your cluster management software). Familiarize yourself with these interfaces and enable e-mail alerts.

Monitor system alarms and system health

If Microway provided you with a preconfigured cluster, then we performed the software integration before the cluster arrived at your site. The cluster can monitor its own health (via MCMS™ or Bright Cluster Manager), but you should familiarize yourself with the user interface and double-check that e-mail alerts are being sent to the correct e-mail address.

Each system in the cluster also supports traditional monitoring and management features:

  • Preferred: learn how to use the IPMI capability for remote monitoring and management. You’ll spend a lot less time trekking to the datacenter.
  • Alternative: listen for system alarms and check for warning LEDs.

Don’t ignore alarms! If you put it off, you’ll soon find that something else is wrong and your cluster needs to be rebuilt from scratch.

Schedule and Test System Software Updates

Although modern Linux distributions have made it very easy to keep software packages up-to-date, there are some pitfalls an administrator might encounter when updating software on a compute cluster.

Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater may unknowingly break compatibility. In particular, upgrading or changing the Linux kernel on your cluster may require manual re-configuration – particularly for systems with large/parallel storage, InfiniBand and/or GPU compute processor components. These types of systems usually require that kernel modules or other packages be recompiled against the new kernel. Test updates on a single system before making such changes on the entire cluster!

Please keep in mind that updating the software on your cluster may break existing functionality, so don’t update just for the sake of updating! Plan an update schedule and notify users in case there is downtime from unexpected snags.

The post Common Maintenance Tasks (Clusters) appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/common-maintenance-tasks-clusters/feed/ 0