IEEE

TFCC Newsletter
Vol. 1, No. 1, April, 1999 - Dialogs on TFCC-L

IEEE Computer Society

Editor's Note: We have decided to archive key dialogs that appear on our open Task Force on Cluster Computing listserver (TFCC-L) by placing them in the Dialogs Section of the TFCC Newsletter. The first discussion deals with the definition of a Cluster. Of course, the opinions expressed are solely the individuals' and not their company's or affiliation's.
  1. Dan Hyde, Bucknell University <hyde at bucknell.edu> initiated the discussion by posting the following:

    We currently have 144 members subscribed to TFCC-L. Certainly that is enough critical mass to promote spirited discussion. :^)

    Let me throw out a possibly controversial question to get the discussion on this list moving:

    "How is cluster computing different from parallel computing or distributed computing or collaborative computing?"

  2. TFCC Co-Chair Rajkumar Buyya, Monash University, Melbourne, Australia <rajkumar@dgs.monash.edu.au> responded with:

    Here are my few initial thoughts:

    Clusters have emerged as a low cost solution providing:
    -- high performance
    -- high availability

    Clusters support both parallel and sequential Programs. E.g. Emerging OSs like Solaris MC And Unixware.

    Cluster Nodes have a strong sense of membership (unlike distributed systems)

    Single Point of Entry: A user can connect to the cluster as a single system (like telnet beowulf.myinstitute.edu), instead of connecting to individual nodes as in the case of distributed systems (like telnet node1.beowulf.myinstitute.edu).

    The above property is exhibited by only CLUSTERS.

    Distributed Systems are too loosely connected, use slow interconnects, and nodes need to be handled explicity.

    Clusters offer Single System Image.

    Cluster Computing is not JUST parallel computing, it is much more.

    I can write many more points, but let us get opinions of other members of list.


  3. Greg Pfister of IBM's Advanced Technology & Architecture, Server Design Division <pfister@us.ibm.com> supplies his definition of a cluster from his book and expands on it.

    I agree that we'll get ourselves fouled up if we have no common definition of cluster, or an inappropriate one. Here's the definition I used in my book ("In Search of Clusters," second edition, published by Prentice Hall. ISBN 0-13-899709-8), p. 72:

    A cluster is a type of parallel or distributed system that consists of a collection of interconnected whole computers used as a single, unified computing resource.
    "Whole computer" is meant to indicate a normal, whole computer system that can be used on its own: processor(s), memory, I/O, OS, software subsystems, applications.

    This seems unsatisfactorily vague to some because it does not give any reason for clusters -- doesn't say why they are used (it avoids a teleological definition). There are two reasons I chose not to do that:

    First, the list is too long to be practical. Clusters have been used to obtain every one of the many possible benefits of parallel (& distributed) systems, and some benefits not usually cited for those. I listed 10 such benefits in Chapter 3 of my book, and could add another two or three today. Not only is the list too long, it keeps expanding. As Raj said, there are "many points" that can be written. Too many.

    Second, for every last item on that long list, there are valid, deployed, useful clusters that do NOT have it. I've seen many fruitless arguments starting "That's not a cluster because it doesn't have X." -- where X is high availability, scaling, fast interconnect, single point of entry, single system image (whatever the reader thinks that means), and so on. The person making the statement usually has a lot of intellectual or real capital bound up in X, and so doesn't want to talk about Y (or divert resources to it). I really don't think we want to be exclusionary in our definition.

    I do think it is a very good idea to have an associated paragraph that lists, non-exclusively, a number of the things clusters are good for. Have to give people some notion of why this is interesting! But that list is not the definition, in my opinion.


  4. Ekechi K. E. Nwokah of Purdue University <nwokah@purdue.edu> expanded on Rajkumar Buyya's comments (See message 2).

    For the purposes of classification, I would lean towards classifying clusters as an alternative method of parallel computing. It is closer to that than anything else, and different from it only in the fact that it's more loosely coupled. In terms of the functions and goals, they are more or less the same.

    Rajkumar Buyya wrote:
    > Here are my few initial thoughts:
    > Clusters have emerged as a low cost solution providing:
    > -- high performance
    > -- high availability

    Yes. Definitely.

    > Clusters support both parallel and sequential Programs. E.g. Emerging OSs like Solaris MC And Unixware

    True. A very nice property of clusters. Do people frequently use both parallel and sequential codes interchangeably though? Many of the clusters I've seen have been dedicated to running parallel programs as a cheap alternative to a more expensive machine and don't really run sequential codes at all. Someone can correct me on this since my knowledge about how clusters are actually used around the world is somewhat limited.

    > Cluster Nodes have strong sense of membership (unlike distributed systems):
    > Single Point of Entry: A user can connect to the cluster as a
    > single system (like telnet beowulf.myinstitute.edu), instead of
    > connecting to individual nodes as in the case of distributed systems
    > (like telnet node1.beowulf.myinstitute.edu).
    > The above property is exhibited by only CLUSTERS.
    > Distributed Systems are too loosely connected, use - slow interconnects, nodes need to be handled explicity.

    I'm not sure that this does not apply in clusters. "Loosely connected" and "slow interconnect" are relative terms. There are many clusters using good old Ethernet as the interconnect. Furthermore, the standard interconnects in clusters -- ethernet (fast/giga), ATM, Myrinet -- are also used for LANs or WANs in distributed or collaborative environments. The interconnects in both distributed systems and clusters are not the same as you would find in a Tera (hippi), or some of the custom interconnects used in Crays, etc., so they are distinguishable in that sense. Whether they are distinguishable going from a cluster to a networked space, however, I don't know.

    > Clusters offer Single System Image.

    This is a highly desirable property of clusters and any cluster which does not have it, should. I'm not sure that it's a requirement though. A cluster does not imply a beowulf. One may have a cluster of machines that does not offer a single system image (PAPERS), and yet be able to run MIMD code on it quite efficiently. For a group of machines to provide a single system image in order to be considered a cluster is a bit much to ask I think.

    Also, when speaking of single system images, one must also consider providing a single file system image. Very few clusters provide this (FUFS, GFS, etc.) and as such many clusters do not provide "complete" single system images.

    > Cluster Computing is not JUST parallel computing, it is much more.

    Yes, but isn't parallel computing the ultimate goal? I mean, aren't we all trying to create a parallel machine for less money? The above statement makes me think that the claim is that a cluster is somehow "better" than an MPP, which I wouldn't agree with. Just my own interpretation, or misinterpretation if you will.

    On the other hand, I do agree that the power to run a cluster as individual/autonomous nodes as well as a synchronised MPP provides a measure of capability not found in other systems.


  5. Rawn Shah of Tucson, Arizona <rawn@rtd.com> tries his hand at a definition of cluster.

    Here is yet another attempt at definition. I don't think it contradicts any ideas so far but you be the judge.

    A processing node is a single system that can execute an algorithm or application instructions.

    A "clustered application" or "cluster-aware application" is one that is aware and takes advantage of the features and behavior of a cluster. An application in this case may be a user application, a system application, an operating system, or a hardware subsystem.

    Using the IETF's lingo, the word MUST means that it is required as part of the definition and without it, it cannot be considered as adhering to the definition. The word MAY means that the system can have additional properties which does not exclude it from the definition under the "MUST" clause.

    A cluster MUST exhibit the following properties:

    1. A cluster contains more than a single processing node.
    2. Clusters are networked together at a level of indirection from the CPU(s) of each node. This may be through a special channel interface, a system bus, a network I/O port, etc.
    3. Processing nodes within the cluster communicate with each other using defined protocols.
    4. A cluster is a single nameable entity. Although individual nodes may have their own personal names or identities, they can be all referenced as a whole with a single name from outside the cluster.

    A cluster MAY exhibit the following properties:

    0. A Cluster may be implemented in hardware, in system software or in application software.

    1. The cluster may partition the data set in three ways:
    a) all the nodes have equal access to the data set
    b) each node may have its own portion of the overall data set
    c) some nodes may share data sets while others contain subsections of the data sets privately

    2. A cluster may spread the processing of a clustered application:
    a) to a specific node
    b) to several but not all nodes in the cluster (either symmetric or asymmetric)
    c) to all nodes of the cluster (either symmetric or asymmetric)

    The processing or load distribution algorithm may or may not be a processing node of the cluster.

    3. The output or results of a clustered application can be:
    a) sent from a single processing node within the cluster with its personal identity (not the clusters identity)
    b) sent from a single processing node within the cluster with the identity of the entire cluster
    c) sent by multiple processing nodes within the cluster, each it its own personal node identities
    d) sent by multiple processing nodes within the cluster with the identity of the entire cluster.

    4. System administration of the cluster can be from:
    a) a single central control point
    b) any number of control points

    The control point may or may not be a processing node of the cluster

    5. A cluster may or may not assume the guaranteed availability of all nodes within the cluster


  6. Geoffrey Hardman <geoff.hardman@newbase.com.au> adds more comments.

    Adding a few more fish to the pond....

    - parallel computing is a performance thing that offers multiple data paths for a single process. In other words, all parts of the process are executed at the same instant, in parallel with each other. Generally speaking such systems only do one task at a time. If one part of the path fails the entire thing must be re-initialised and the process restarted. This is a performance model, not an availability or transaction oriented model. Parallel architectures are useful for predictive modelling but not for transaction based stuff. Examples of this include ccNUMA, or the NASA Linux-based Beowulf server.

    - clustering offers the ability to scale by spreading process loads for multiple (repetitive?) processes. In this model the processes are executed sequentially, but multiple such processes may be underway at any given moment, and spread across all nodes. The load can be statically managed, or dynamically managed, the nodes can be discrete systems or one logical entity spread over multiple nodes. The beauty of this model is that it not only offers scalability in terms of the number of processes that can be executed at a given moment, it also offers availability through both it's multiple discrete node model, and generally also via fail-over capabilities to enable partiatially complete processes to be successfully concluded on a new node, should the orginal node fail. Examples of this include Non-stop UnixWare, SCO ReliantHA, Microsoft Cluster Server (Wolfpack), Digital TrueCluster, Sun's cluster model and so on.

    - collaborative computing is something different again, and is a 'document' based way of doing stuff. In this model each 'node' has certain duties, some of which overlap individually with other nodes, and all of which are also shared with at least one other node. As bits of the process are completed the status is 'advertised' to the rest, and the next available node with the suitable 'duty' assignment picks it up and does it's bit. This model suits a workflow environment but is way too slow for transaction based systems, and utterly useless for predictive modelling. An example of this is Lotus Notes. Note, that Exchange does not fall into this category as it really only replicates, tasks are not shared around available servers.


  7. Steve Chalmers of Hewlett-Packard <fsc@core.rose.hp.com> tries a marketplace style definition.

    Perhaps it is time to take a cue from the so-called "cluster" products in the marketplace and pursue not a better definition of "cluster" but rather a problem statement closer to the needs of the vast majority of applications.

    Try, "A modularly expandable pool of system (CPU, RAM, IO) and storage (disk, tape) resources which can be flexibly allocated among applications (services, batch jobs, traditional apps, traditional timesharing, even parallel services), enforcing and protecting resource boundaries between applications, by a single administrator (or cohesive team of administrators)."


  8. Bill Moshier of Troika Networks, Inc. <billm@troikanetworks.com> responds to Steve Chalmers (See Note 7.)

    Good definition. But perhaps a simpler one would be:

    What is a cluster?

    A cluster is a group of servers, workstations, and storage subsystems that are linked together by a SAN so they act as a single, large computing resource.

    What are the benefits of clustering?

    Clustering offers scalability by allowing multiple SHV [Standard High Volume] servers to work together. The cluster combines the processing power of all servers within the cluster to run a single logical application (such as a database server). Furthermore, additional processing power can be easily provided by adding servers to the cluster.

    Clustering offers availability by allowing servers to "back each other up" in the case of failure. When a server within the cluster fails, another server (or servers) picks up the workload of the failed server. To the user, the application that was running on the failed server remains available.

    Clustering offers manageability by providing a "single system image" to the user of the cluster. The user sees the cluster as the provider of services and applications. The user does not know (or care) which server within the cluster is actually providing services.

    This is a definition of clustering from the VI Architecture viewpoint.


  9. Greg Pfister <pfister@us.ibm.com> responds to Bill Moshier (See message 8).

    [Bill] I like your simplification. In fact, it moves the definition close enough to mine that I'm going to "deconstruct" it to show why I like mine better. Note -- if I didn't think it wasn't already pretty good, it would be too much work to do this!

    You started: "A cluster is a group of servers, workstations, and storage subsystems..."

    Not PCs? You've alienated many Linux/Beowulf people, who glibly glue together castoff desktop systems otherwise destined for recycling. Not "thin servers" or "network appliances" or other newly minted concepts? The possibilities are large. That's why I prefer the umbrella term "whole computer," since that covers just about anything anybody is likely to dream up.

    Also, "whole computer" implicitly includes storage subsystems. I don't know that storage is more intrinsic than communication, tapes, scanners, geophone arrays (they actually use large SPs on board oil exploration vessles, connected to those things), whatever.

    You continued: "... that are linked together by a SAN ..."

    So systems without a SAN are disqualified? You've alienated the Linux/Beowulf crowd again. This time you've also tossed out users of today's clusters of HP, Sun, RS/6000, and other systems. Better duck... And even when VIA becomes more common (and I've little doubt it will), there are going to be lots of clusters connected by plain old Ethernet. Or wet string. Or whatever.

    And I'm afraid I'll have to violently object to the use of the acronym SAN. It used to mean "System Area Network," but the Fibre Channel folks have plastered the clueless media with a renaming of FC to "Storage Area Network," so we have effectively lost the ability to use those three letters in that order.

    You concluded: "...so they act as a single, large computing resource."

    Only thing to object to here is "large," on two grounds - absolute, and relative. There are lots of uses for really small clusters in an absolute sense, e.g., 2-node highly-available file servers. In numerical terms (the "ships" that software people like to count) those are probably going to be the most common. In a relative sense, there are also lots of clusters that aspire to no more performance than you get from a single node: The owners just want a full hot-standby system.

    When those changes are made, it turns into my definition: A collection of connected whole computers that are used as a single resource.

    From my prior comments, it can be predicted that I'd agree with separating the benefits of clusters from the definition. However, I find your list a bit short. Here's an non-exhaustive outline list of clusters uses I used recently:

    performance -- throughput
    performance -- turnaround time
    availability -- avoidance of unplanned outages
    availability -- "continuous," meaning avoidance of planned outages
    cost/performance
    administration
    incremental growth
    scavenging of "unused" MIPS on desktops
    separation of incompatible workloads
    server consolidation
    release migration
    mixed production and test environments
    constrained systems
    security - Firewalls
    security - other reasons to separate workloads based on security / access issues

    I'm sure there are still a few missing. No, make that "many" missing.

    We all have our hobbyhorses. Mine is avoiding others' hobbyhorses for uses of clusters. (Sort of like being a Positivist.)


  10. Bill Moshier responds to Greg Pfister with concerns about "whole computer" (See message 9).

    Greg - You've deconstructed it well! I would fully agree with the points you've made - and I'd be very glad to work within the simple definition you've proposed.

    >>A cluster is a type of parallel or distributed system that consists of
    >>a collection of interconnected whole computers used as a single, unified computing resource.

    The only item I would like to discuss would be 'whole computer'.

    >>"Whole computer" is meant to indicate a normal, whole computer system
    >>that can be used on its own: processor(s), memory, I/O, OS, software subsystems, applications.

    Perhaps something like 'computing platforms' would be more descriptive - for do nodes in a cluster necessarily have to be able to function on their own? They must have processor, memory, io, some kind of os, and applications. This change, however may make the cluster concept too broad, and unworkable.


  11. Greg Lindahl of High Performance Technologies Inc (HPTi) responds to Bill Moshier's use of 'computing platforms' (See message 10).

    I am reminded of a Supreme Court justice who opined that he couldn't define pornography, but he was sure of it when he saw it. I can think of some sub-computers which I consider clusters and some which I don't. Is the T3E a cluster? Each node has its own memory and microkernel. Well, maybe not. But what about the proposed StrongARM based PCI expansion cards? Take a PC, plug in 20-odd cpus, 5 to a card, each running Linux. The host PC provides disk I/O and between-box networking. Is that a cluster?

    Most people would say "no" to the T3E, and "yes" to the second. There's a good reason for it; the T3E microkernel is forwarding lots of system calls verbatim to the OS nodes, so the compute nodes really aren't independent of the OS nodes. But the StrongARM gizmo really has separate machines using the PCI bus like a network. But you'd be hard pressed to write a description of finite length which would apply to this situation and to others that I could dream up.

    If you don't like my StrongARM example, then consider the cluster that Stanford CS is building: 1200 nodes, each in a separate PC case. But they're diskless.


  12. Greg Pfister responds to Bill Moshier about "whole computer" (See message 10).

    Bill - Glad we have managed to achieve at least pairwise agreement.

    About "whole computer": I really think that's necessary. As you suggested, I think the definition otherwise becomes too general. I put "whole computer" in there to distinguish it from other forms of parallelism, like:

    SMP: only thing replicated is the CPU (and cache, etc.)

    SIMD: only thing replicated is the ALU, some registers, and possibly some memory

    NUMA: nearly all replicated, but not the OS.

    Partitioned NUMA, with multiple copies of the OS, is of course a cluster, as is a partitioned SMP. Things like Tandem/SCO single-system-image and Sun's some-day SSI Full Moon actually have multiple copies of the kernel, specifically so that each cluster node can still function if the others fall down. They just communicate enough that you can't tell that they're not one system, at least from some viewpoints. If the Stanford Hive NUMA-based stuff ends up with adequately stand-alone "cells," I'd welcome them to the advantages of having partitioned the NUMA enough to have become a cluster.

    Also, one of the few constants I've found in clustering is that customers -- users -- do it themselves, and the parts they glue together are nearly always quite readily described as "whole computers."

    "Computing platform" -- maybe. I believe I could live with that. But I'm a bothered by what I see as a variable definition for that term. I've heard "platform" often taken to mean hardware + OS. But I've also heard it used for hardware + OS + database + Java Interpreter + transaction monitor + message-queueing system + other miscellaneous middleware, on top of which some benighted (and wealthy, apparently) user expects to construct an application. Is there a firm definition somewhere? It might mean "a collection of h/w & s/w on top of which you build applications." If that's so, and generally accepted, then: Do we want to say the apps themselves are on our cluster nodes, or not? I included them, since I didn't want to define something only usable by software developers.

    On a different axis: "Whole computer" is quite a bit less jargon-y than "computing platform." I like that, but an argument could be made that the likely readerhip will be more comfortable with jargon.

    Either way, I think we will need a separate sub-definition of whichever term is used.


  13. Tim Mattson of Intel Parallel Algorithms Laboratory <timothy.g.mattson@intel.com> agrees with Greg Pfister on the term "whole computer" (See message 12).

    I know we aren't putting it to a formal vote, but I want to go on record as strenuously agreeing with Greg Pfister and his support for the term "whole computer" in the proposed definition of a cluster. I think its much clearer.

    I also second his comment

    ... one of the few constants I've found in clustering is that customers-- users -- do it themselves, and the parts they glue together are nearly always quite readily described as "whole computers."

    This is important enough that in any discussion of the definition of cluster, I would include some version of this comment.


  14. TFCC Co-chair Mark Baker <Mark.Baker@port.ac.uk> asks for status on our definition of cluster.

    I've been following the discussion about the definition of a cluster with great interest...

    I'm in the process of writing a couple of presentations for the IPPS/SPDP conference where I'm going to talk about the TFCC. There are two things in particular I would like to talk about; our definition of a cluster and the reasons why there should be a TFCC.

    So my first question is where are we with regards the definition of a cluster ? I want to put together a couple of sentences that define a cluster. I fairly happy with Greg's definition...

    A cluster is a type of parallel or distributed system that consists of a collection of interconnected whole computers used as a single, unified computing resource.

    Where "Whole computer" is meant to indicate a normal, whole computer system that can be used on its own: processor(s), memory, I/O, OS, software subsystems, applications.

    I guess from my point of view I am not totally keen on the term "whole" - but the alternatives are no better (platforms/systems)! Does anyone have a better short definition ?

    With regards the rationale for the existence of the TFCC. I have the following general points so far:

    I look forward to your comments to the points made above.


  15. Rawn Shah <rawn@rtd.com> asks Mark Baker for a clarification on Single-Board Computers (SBCs) (See message 14).

    Just as a clarification, I take it that this definition does include Single-Board Computers (SBCs), where they each have their own cpu, memory, I/O controller and even disks but run within an external backplane for power and which provides a separate controller unit to manage the SBC units?

    This sort of falls into the description of the earlier example of the StrongARM based boards however, the SBCs are separate computing units, just not a physically "whole" computer; the SBC cage usually has other unifying interfaces such as a keyboard-video-mouse switch, power supplies, and fans.


  16. Greg Pfister <pfister@us.ibm.com> responds to Greg Lindahl (See message 11).

    Greg, I think at least part of this depends a lot on what terms happen to be fashionable.

    For example, there was a period during which the RS/6000 SP folks were adamant that they *were* *not* a dorky little cluster. They were (pause for effect) MASSIVELY PARALLEL, which was the cool thing to be at the time.

    Then the winds of Industry opinion shifted. Clusters became cool -- viewed as broadly valuable tools that anybody could deal with; while MASSIVELY PARAllel started to be viewed as a more niche-y thing you needed to hire 17 Ph.D.'s to run. (Neither statement is completely true, of course. Irrelevant.) So the SP folks started calling it a cluster. This was of course done without the slightest change in system architecture or development priorities. Probably others have similar stories. It's all marketing, whether to customers or to funding agencies. Or to upper management.

    Now, the SP is at least a little more cluster-like than the T3E. Each node does, for example, have a complete OS kernel. Nevertheless, I wouldn't put it beyond the realm of possibility that the reason "most" people would say "no" to the T3E -- say it's not a cluster -- is that the folks you're thinking of, ones who would buy and use T3Es, remain steadfastly unmoved by the flickering vagaries of Industry fashion.

    Or else they just haven't caught on yet. :-) (Well, 1/2 :-).)

    That said, I'd point to something I think is a significant difference between the T3E and the StrongArm expansion cards you mention: scale.

    There are many, many issues -- logical, system organization, programming, and physical (power, packaging, cooling) -- that only arise when you aim at using 100s or 1000s of units of whatever type. Many are quite specific to problems of scaling. There are things you must do when you want to scale that far which can even be counterproductive for smaller systems; scaling down and scaling up have different sets of problems.

    In fact, if I could figure out a way to do it reasonably, I'd limit the definition of "cluster" to systems that aren't "massively scalable," just because the TC on parallel processing has a long history of dealing with scaling issues, and, since I'm personally tired of those issues, I don't think there's a point to that redundancy. But I can't, at least so far, so I won't propose it.

    And yes, this means I've got some major qualms about that 1200 node thing at Stanford.


  17. Greg Pfister <pfister@us.ibm.com> responds to Rawn Shah (See message 15).

    I'd certainly include SBCs.

    I'm also aware of the fuzzing of the meaning of "whole" in this context.

    Were we to get picky about this, I'd probably start muttering about "logical wholeness" and "ability to boot independently" while worrying a lot about "diskless" systems that boot over a commo link or from another system's IO subsystem.

    I'm pretty sure I would not want to include compute-only units that do no IO on their own at all.


  18. Niraj Srivastava, Senior Consultant at CLAM Associates <niraj@clam.com> tries a hand at defining a cluster.

    A cluster is a group of independent nodes interconnected and working together as a single system providing capability, availability and scalability. Here a node is is a "box" consisting of a "CPU" and OS providing cluster management, interprocessor communication and I/O services.

    The cluster today is evolving from three distinct lineage, and definition of cluster gets influenced by the lineage one follows. The challenge in defining the cluster is to included all three roles today and in the future.

    Clustering for capability (parallel computing) The simultaneous use of more than one CPU to execute a program. Ideally, One decomposes (data and/or task decomposition) a single problem to run on many computing nodes. May also use many IO nodes to provide parallel IO (requires support for many-to-many mapping - many memory to many IO devices). A node may be SMP computer. However, this type of parallel processing requires very sophisticated software called distributed processing software. For example see: http://www.beowulf.org/ This kind of cluster would challenge T3E, CM-5, "SP-2"

    Clustering for availability ( reliability) Cluster have a long history of this role. Clustering provides increased reliability over non-clustered systems. Should one node, in a cluster fail, the fault is isolated to that particular node and its workload is automatically spread over the surviving nodes. Examples of this: IBM's HACMP, HP MC/ServiceGaurd, MSCS (Wolfpack), SUN's Solstice HA.

    Clustering for scalability (load balancing, workload management) Users are able to expand power and improve performance by adding nodes to the cluster incrementally and balancing the workload among them. Workload management needs: workload scheduling, workload analysis, workload monitoring and administration. For more information see: http://www.platform.com http://techreports.larc.nasa.gov/ltrs/94/tm109025.tex.refer.html This kind of cluster would challenge SMP class machines, multi headed Crays, SP-2

    The three functionality can be viewed as three apex of a triangle:

                             CAPABILITY
                                 *
                                *  *
                               *     * 
                              *        *
                             *           *
                            *              *
                           * * * * * * * * * *
                   SCALABILITY           AVAILABILITY
    
    The clusters in the past have not strayed too far from the apex, and if they have moved, it is only along the sides until recently. One example is Oracle Parallel Server (OPS). OPS uses the cluster for capacity, availability and scalability.

    So why did I replace "whole computer" with node. I shall explain this with a historical perspective. The definition of node is ever changing with changes in technology. Yesterday a node was a processing element, today is a " whole computer," tomorrow it might be an appliance with personality.

    In the late 80's and early 90's when I was evangelizing parallel computing for Thinking Machines from the pulpit of Danny Hillis and Guy Steele, we called bit processors of CM-2 the node. Later on CM-5, SUN's SPARC processors became the center of the node, but there were many kind of nodes: processing nodes, IO nodes, control nodes. They were all slightly different based on specialization but were none-the-less nodes interconnected to perform a capability computing. The idea was to use commodity parts and leverage of Moore's Law to create supercomputers. The networks were still proprietary. It became cheaper to use the pizza box rather than open the box up and in comes the SP-2 thus the node becomes a "whole computer." In the mean time commodity networks improved from Ethernet to FIDDI, DEC Gigaswitch, ATM, HIPPI, FCS, SCI ... And viola we are talking of clusters as supercomputers.

    What we need is the glue that allows a cluster to look like a single system or in Larry Smarr's term a metacomputer doing capability, scalability and availability as required by the application.


  19. Greg Lindahl <lindahl@cs.virginia.edu> responds to Greg Pfister (See message 16).

    > I think at least part of this depends a lot on what terms happen to be fashionable.

    >For example, there was a period during which the RS/6000 SP folks were
    >adamant that they *were* *not* a dorky little cluster. They were (pause for
    >effect) MASSIVELY PARALLEL, which was the cool thing to be at the time.

    That's right. I know SP users who are *still* adament that they are not a (dorky) cluster.

    I don't have a magic definition which is better; I'd say "whole computer" is the best one yet. But we should realize that our definition won't be comprehensive, and we're going to get overlap with other TCs. (I can't imagine how a good definition is going to prevent overlap!)

    >In fact, if I could figure out a way to do it reasonably, I'd limit the
    >definition of "cluster" to systems that aren't "massively scalable," just
    >because the TC on parallel processing has a long history of dealing with
    >scaling issues, and, since I'm personally tired of those issues, I don't
    >think there's a point to that redundancy. But I can't, at least so far, so
    >I won't propose it.

    Now them's fightin' words! Yes, there are a lot of scaling issues that the TC on parallel processing has dealt with. But they usually don't do it with commodity technology.


  20. Steve Chalmers <fsc@core.rose.hp.com> responds to Greg Pfister (See message 17).

    >I'm pretty sure I would not want to include compute-only units that do no IO on their own at all.

    But I would include a unit with processor, memory, and at least one "Future I/O" port, even though the I/O cards it uses are located in a SAN-IO fabric -- and shared between units -- rather than being part of the unit. This circa-2001 hardware design is remarkably similar to those you quite correctly exclude today.

    Suspect this means the definition needs to be one of drivers running in the unit rather than one of hardware included in the unit.


  21. Greg Pfister <pfister@us.ibm.com> responds to Steve Chalmers (See message 20).

    Oh, bother. Forget circa-2001. Tandem Himalaya systems started looking like that 5-7 or so years ago. It's what they invented the term SAN for (that's System ...).

    I even pointed them specifically out in the 2nd edition of my book, saying that they were like nothing else on the planet -- following which, naturally, a colleague from S/390-land proceeded to argue with me that this is also what S/390 mainframes have looked like for quite a while, with things like "channel directors" where "switches" live in SAN diagrams.

    I guess just because I wrote it doesn't mean I read it.

    In a minute I'm going to be reduced to "I don't know how to define 'whole computer,' but I know one when I see it."


  22. Rawn Shah <rawn@rtd.com> raises another special case to consider.

    Another special case which may need consideration:

    The Sun Starfire/Ultra 10000 allows multiple 'domains' each running its own copy of Solaris on a given set of processors. Each of the processor boards has it own RAM and up to 4 processors, and an SBUS or PCI peripheral bus. Each processor board can have its own disk system through SBUS or PCI boards as well as network interfaces. However, typically, the system uses a separate crossbar switch to a separate peripheral device system which connects a number of drives. The machines can be clustered together independently as each domain of system boards work as an independent machine.

    This is sort a single large machine sometimes, and at other times a set of independent machines within a single cage depending upon the installation. Unlike the IBM SP machine, the nodes are all contained within a single cabinet and backplane. They can, however, all be controlled from a single console system.

    I have not worked with these machines yet. However, in my estimation, this is something like the old IBM VM machines except that the OS domains run on separate processors as assigned.

    FYI: http://www.sun.com/servers/enterprise/10000/

    Any ideas anyone?


  23. Tim Mattson <timothy.g.mattson@intel.com> proposes another definition.

    We can go in circles forever trying to define a cluster. I'm hesitant to even stretch out this discussion by proposing yet another definition. I think its really important, though, that we come up with a definition we can all live with.

    I think the key point about a cluster is that a "reasonable end user" can put them together themselves. If it requires special hardware tightly integrated with the nodes (such as with the Sun Starfire/Ultra 10000 or the IBM SP machines), then its not a cluster. If the user can grab N boxes from their vendor of choice and connect them together with an off the shelf switch, then its a cluster.

    So I would use as my definition:

    A cluster is a type of parallel or distributed system that (1) consists of a collection of interconnected computers used as a single, unified computing resource (2) can be constructed by a knowledgeable end user from "whole computers" (i.e. computers that can be used as stand alone systems) and (3) uses off-the-shelf network interconnection technology.

    I think this definition includes the well known clusters and excludes the MPP-like systems with dedicated, specialized hardware.

    So, are we getting anywhere with this, or are we doomed forever to spin our wheels trying to agree on what a cluster is.


  24. Greg Pfister <pfister@us.ibm.com> responds to Rawn Shah (See message 22).

    I'm quite familiar with those beasts. What you're describing is what's more widely referred to as "partitioning": Separating one physical system into multiple, smaller ones, each of which runs independently: has its own OS, applications, etc.

    Sequent NUMA-Q systems also do this, as do some Intel-based Unisys systems, and mainframes have done it (in addition to Virtual Machines) for decades. **Extremely** popular feature. There are essentially no mainframes that aren't run partitioned. The Intel-based systems have a tendency to do it because NT doesn't support more than 4-way SMP, so what else are you going to do with a 64-processor system (Unisys, Sequent)? Yeah, there are other OSs... but they''re not NT.

    Anyway: I don't think it needs special consideration, because when you partition one system into many, then connect them, what you get is -- a cluster. It happens to be all housed in one cabinet, but that's OK.

    What's not so obvious (I've had to try to do it) is convincing customers that you really do have a cluster -- with all the system management and other issues that entails.


  25. Greg Pfister <pfister@us.ibm.com> responds to Tim Mattson's definition (See message 23).

    >A cluster is a type of parallel or distributed system that
    >(1) consists of a collection of interconnected computers used as a single, unified computing resource
    >(2) can be constructed by a knowledgeable end user from "whole computers" (i.e. computers that can be used as stand alone systems) and
    >(3) uses off-the-shelf network interconnection technology.

    We'll hear from others, I'm sure, but I'm afraid I personally don't agree with this one, Tim, for several reasons.

    - It excludes too much of the industry. The arguably most functional clusters now existing are all proprietary: Compaq/DEC VMScluster, IBM Parallel Sysplex, Tandem Himalaya. At least those get my vote as currently having the broadest and deepest collections of cluster functions.

    - It eliminates publishing research in areas like inter-node communication hardware, since if it's research, it surely isn't off-the-shelf.

    - Why require that only the hardware be off-the-shelf? Wouldn't the same logic say you should do the same with software? That would leave us with rather little to do.

    - What you're describing is just too difficult for most customers to do in the first place. With the current state of the art, making an HA cluster (or one for scaling) work consistently, with anything like commercial utility, is *hard*. Evidence: Even in the pure Intel-based system space, users can't mix & match off-the-shelf parts and run Microsoft Cluster Services on it, because Microsoft only certifies whole cluster systems delivered intact from vendors. They were going to certify piece parts, but have given up because of the grotesque combinatorics of the testing problem.

    - Actually, given the above situation with Microsoft, you've eliminated most Intel-based clusters, too!

    Oh, yes, and:

    - It doesn't address my "small scale" issue; The Linux folks had a get-together last Decenber in Europe, brought their PCs, and built a 500-some-odd-way system on which they merrily ran a parallel version of PovRay.
  26. TFCC Co-chair Rajkumar Buyya <rajkumar@dgs.monash.edu.au> responds to Mark Baker's message (See message 14) and tries to summarize the discussion so far.

    We have put a definition of Cluster (in our chapter--same as Greg) in the book: "High Performance Cluster Computing" <http://www.dgs.monash.edu.au/~rajkumar/hpcc.html> as:

    A cluster is a type of parallel or distributed processing system which:
    * consists of a collection of networked computers, and
    * is used as a single, integrated computing resource.

    This or Greg's definition tries to include many clusters and similar systems.

    Tim of Intel has added 2 more points to Greg's and the above definition. I guess we need not include them as part of definition because--(3)rd point was on knowledge of cluster assembler and 4th on off-the-self network. These days people are using both private and off-the-self networks for interconnecting network, but they claim their system as cluster-based.

    Our chapter on: "Cluster Computing at a Glance," can be found at: <http://www.dgs.monash.edu.au/~rajkumar/cluster/v1chap1.ps.gz> Please download and make your comments on the chapter.

    >I guess from my point of view I am not totally keen on the term "whole" - but >the alternatives are no better (platforms/systems)! Does anyone have a better short definition ?

    We need not use the term "whole computer", just "computer" can be sufficient. As we know definition cannot be standardized as everyone has their own way of explanation.


  27. Greg Pfister <pfister@us.ibm.com> responds to Mark Baker's message (See message 14).

    I agree with what you've said, with one fairly minor exception:

    > the topics that overlap are to an extent historical - if the TFCC had appeared 5/6 years ago maybe there would not have been overlaps.

    I don't think they're historical in that sense. If anything, the situation would have been worse from the point of gaining attention / publication / funding / whatever -- clusters hadn't been recognized as a viable form of life by anybody in the normal publication channels at that point. Fewer people had retired.


  28. Barry Wilkinson, Department of Computer Science, University of North Carolina at Charlotte <abw@uncc.edu> comments on use of clusters in universities.

    Cluster computing has opened the door for many smaller institutions to become involved in parallel programming and this is what makes clusters different.

    Traditionally parallel processing conferences concentrates upon algorithmic research. TFCC can also foster the wider impact of clusters. For example, I am currently involving in trying to get funding for a minority institution get started in this area.

    I also believe that there is more to cluster computing that conventional parallel programming research areas, such as numerical problems. For example, clusters should be ideal as fast web servers.

    Obviously with my responsibility as Education coordinator, I think we should have a significant presence here, which would be different to many major parallel programming conferences.


  29. Tim Mattson <timothy.g.mattson@intel.com> responds to Rajkumar Buyya (See message 26).

    My concern with some of our definitions of cluster is that they are too broad. We need to ask ourselves, "what makes a cluster different from an MPP". A legitamate answer is "nothing", though if thats the case, we have a hard time justifying why there is even a TFCC.

    For example, consider the definition you gave:

    A cluster is a type of parallel or distributed processing system which:
    * consists of a collection of networked computers, and
    * is used as a single, integrated computing resource.

    This Includes just about every distributed memory computer in existence. Do we really want to call a Cray T3E or a Paragon a cluster? I don't think so.

    I think Greg's definition comes much closer to a useful definition since it requires "whole computers". I'm not sure "whole computers" is the best phrase, but it does nicely narrow a cluster to parallel systems built from nodes that can operate as stand alone computers.

    I added my two additional points in a feeble attempt to further narrow the definition. A "whole computer" definition still includes systems like the IBM SP. This machine has legitimately been called a cluster --- and maybe that's the way we as a group would like to see it classified.

    I would prefer, however, that the IBM SP not be considered a cluster by the TFCC definition. It is built, supported, and used like a traditional parallel supercomputer.

    As we think of a definition for cluster and a "mission statement" for the TFCC, we should keep clear in our minds the reasons for the popularity of clusters. I think its because anyone can -- at least in princible -- put a cluster together. Whole computers, a separate network (wheather private or off-the-shelf), plus some software to glue the whole thing together and you have your cluster.

    The "roll your own" feature gives us a way to distinguish ourselves from the other parallel computing groups. We work with clusters so we need to worry about the glue that makes a loose ensemble of computers act like a single system. Its this focus on the glue that makes cluster people different from MPP people.

    I fear, though, that like the workshop discussions a few weeks ago, that we are spinning our wheels and are on the verge of irratating the broader list with this discussion. I would be happy to accept the IBM SP as a cluster and stick with Greg's definition.


  30. Greg Lindahl <lindahl@cs.virginia.edu> responds to Tim Mattson (See message 29).

    >I would prefer, however, that the IBM SP not be considered a cluster by the TFCC definition. It is built, supported, and used like a traditional parallel supercomputer.

    My company is in the business of building machines which are built, supported, and used like an SP. But everyone would think my machines are clusters -- Alpha desktops, Linux, myrinet, racks. You can't put that genie back in the bottle -- the SP is a cluster.

    >I would be happy to accept the IBM SP as a cluster and stick with Greg's definition.

    I think pretty much everyone has said the same thing.


  31. Tim Mattson <timothy.g.mattson@intel.com> responds to Greg Lindahl (See message 30).

    > I think pretty much everyone has said the same thing.

    Not quite. The definition Raj gave in his message dropped the use of the "whole computer" concept. I think the "whole computer" or "stand alone computer" is vital to any definition of a cluster. I could live with a definition that let an SP be a cluster (though I think its a mistake), but Raj's definition included the T3E and the Paragon. We should not define these MPP systems as clusters.


  32. Niraj Srivastava <niraj@clam.com> responds to Tim Mattson (See message 29).

    I think "the role your own" concept is the key to differentiating parallel or distributed processing system from clusters. As I attemped to explain in my previous email. It is the advancement in technology that makes cluster possible and viable as a "metacomputer" but I would prefer the term commodity components to "roll your own." What differentiates clusters from CM5, SP2, E1000, T3E is the promise to provide same functionality at a much better price/performance.

    Let me retry the definition:

    "A cluster is a object made of commodity components that works as a single system providing capability, availability and scalability computing platform. The components consist of nodes, OS and interconnects." At this point we can define nodes, OS and interconnect characteristics and requirements.

    I would stay away from whole computer because it would be nice to allow mass storage devices with direct connection to the interconnect and difine a parallel file system. Or other kind of specialized nodes like for visualization; thus allowing me to do parallel processing that involves some nodes crunching and piping IO for visualiztion to other speciality nodes.


  33. TFCC Co-chair Rajkumar Buyya <rajkumar@dgs.monash.edu.au> tries another attempt to summarize cluster definition.

    How does the following sound (improved based on comments of all others and has bits and pieces from all of us, I think so):

    A cluster is a type of parallel or distributed processing system which:

    * consists of a collection of networked stand-alone computers, and
    * working together as a single system providing/offering capability, availability, and scalability (**services**).

    This way by default we can understand:

    single-system-->single system image
    capability --> "high-performance"
    availability --> "high-availability/fault tolerance" scalability --> can easily be grown... as hardware/software configuration changes. (**services**) can be dropped.
    We can reorganize the above definition.

    Thus we can avoid confusion with "whole computer" term and express their major properties.


  34. Niraj Srivastava <niraj@clam.com> responds to Rajkumar Buyya's latest definition. (See message 33).

    Raj, I think you have captured the essense of the discussion of the last few days.

    One clarification: stand-alone computer is generic device that contributes to providing services descibed.

    I can live this definition.


  35. Greg Pfister <pfister@us.ibm.com> responds to Rajkumar Buyya's latest definition. (See message 33).

    Raj, I'm going to play "deconstruction" games again, and then ((semi-) re-) state what I think is a converged definition. This got too dang long. Sorry.

    First: I think we've already discussed the problems with making what clusters are *for* part of the definition. They are used for many things; the list is open-ended. For example, I sent out a list of 16 things clusters are used "for." Some of them aren't covered by your list. I make no claim that my list is comprehensive.

    Also, you said "networked" rather than "connected." I prefer the latter, since "networked" could be interpreted as "being part of an IP network," the way PCs connected to a server are referred to as "networked."

    You also said "working together as a single system." That's not bad, but "used as a single resource" is, I think, slightly and appropriately more general. This of course depends on what you mean by "system" and "resource." Do you call a batch farm a "batch system" or a "computing resource"? Is an HA file server a "shared file system" or a resource? Both words are overused, I think. My own tendency is to shy away from any term that slides into "single system image," which I believe has been misused terribly; that's really why I prefer "resource."

    Second: It's been sounding to me like many people in the debate are converging around a modified version of my definition, with additions.

    A cluster is a type of parallel or distributed system consisting of connected whole computers that are used as a single resource.

    The term "whole computer" is an intentionally vague reference to a computer system that is able to function independently.

    Clusters are used to obtain any of a large number of possible benefits, not all of which require scaling the cluster up to large numbers of its constituent whole computers. Some of the possible benefits include increased capacity, faster turnaround time, higher throughput, higher availability, separation of incompatible workloads, and lower cost of administration. This list is illustrative only; a cluster may exhibit any, all, or none of those particular benefits.

    Greg Lindahl please note: I snuck in a comment about scaling to be sure we avoided fixation on MASSIVE systems, but certainly didn't rule out big 'uns.

    Also note the addition of "intentionally vague reference." If you can't fix it, feature it.

    I have to say that I'm still bothered by a possible perceived lack of differentiation from distributed systems. Possibly we need some added words that explicitly attempt to make that distinction. I still wouldn't make them part of the definition -- possibly added as another paragraph? How about this:

    Clusters are distinguished among the types of parallel systems by their use of whole computers as constituent parts. They are distinguished from general distributed systems by degree of coupling; compared with general distributed systems, clusters' constituent computers generally have tighter physical coupling (such as fast communication links) and/or functional coupling (such as common administration).

    Comments?


  36. Tim Mattson <timothy.g.mattson@intel.com> responds to Greg Pfister (See message 35).

    Greg, In your last contribution to the great cluster-definition debate, you said

    Clusters are distinguished among the types of parallel systems by their use of whole computers as constituent parts. They are distinguished from general distributed systems by degree of coupling; compared with general distributed systems, clusters' constituent computers generally have tighter physical coupling (such as fast communication links) and/or functional coupling (such as common administration).

    I disagree with your view that clusters must include "a tighter physical coupling" or a "functional coupling". I see no reason why a collection of desktop PC's connected by an ethernet network should not be viewed as a cluster. If you can use MPI or some other software layer to make the ensemble of computers work together on the same problem, then its a cluster.


  37. Greg Lindahl <lindahl@cs.virginia.edu> reponds to Greg Pfister (See message 35).

    >Greg Lindahl please note: I snuck in a comment about scaling to be sure we avoided fixation on MASSIVE systems, but certainly didn't rule out big 'uns.

    That's an important point. A classic high availability cluster is 2-3 nodes, and we want to include all 3 types of clusters in our definition.

    > Clusters are distinguished among the types of parallel systems by their use of whole computers as constituent parts. They are distinguished from general distributed systems by degree of coupling; compared with general distributed systems, clusters' constituent computers generally have tighter physical coupling (such as fast communication links) and/or functional coupling (such as common administration).

    That's about as good of a try as I can think of. It's important for us to say what we think the differences are, even if we can't do a great job of it.


  38. Greg Pfister <pfister@us.ibm.com> responds to Tim Mattson (See message 36).

    >I disagree with your view that clusters must include "a tighter physical coupling" or a "functional coupling".
    >I see no reason why a collection of desktop PC's connected by an ethernet network should not be viewed as a cluster.
    >If you can use MPI or some other software layer to make the ensemble of computers work together on the same problem, then its a cluster.

    Then I have to ask how you would distinguish clusters from distributed systems.

    I don't think PCs connected by a LAN are a cluster just because some day somebody might run a MPI or PVM or whatever on them. Wouldn't that make every distributed system a cluster? If cluster = distributed system, we should go join the TC on distributed systems and stop this TF.

    At the same time, of course there are LAN-connected clusters whose nodes are PCs. It's the way they're used that makes a difference. Take some PCs on a LAN *and* put the appropriate software on them; then while that software is running those PCs are a cluster. The software constitutes the "functional coupling."

    And yes, that means the same collection of hardware can be a cluster some times, and not a cluster at other times. That might sound strange, but it follows from the "or" in the description I used -- physical and/_or_ functional coupling. The coupling doesn't have to be both.

    *****************************

    What follows is a rant that I initially composed in answer to your note. Then I re-read what you wrote, and decided you really were asking a different question. So I answered what I now think is the real question above. But I put some effort into the rant, so what the hey, rather than pushing the "delete" key I might as well point it towards the archives. Ignore, read, or comment as you wish.

    *****************************

    The issue to me isn't having a parallel program run using MPI or PVM or VIA or whatever.

    The issue, rather, is how you deal with the whole bunch of other stuff you have to do to make the ensemble work correctly, for example: making sure the right daemons are running on each system, having the files available that the parallel application uses, having user IDs of some kind available on the systems under which the parallel application runs (if the systems support the notion of IDs), making sure the PCs are powered-on, etc. In other words, all the gorp that doesn't get discussed in a parallel programming class, but is necessary. (And in practice, costs a whole lot of money in administrative personnel costs when done by real customers.)

    It's possible to get that "whole bunch of stuff" working on a totally ad hoc basis. It's also possible to establish a common framework to make it easier -- software, but also possibly physical, even as simple as putting them all in a rack with a common power-on switch.

    If there is some kind of common support framework, then I'd say it's a cluster. If not, it's not; it's a distributed system, but not a cluster.

    There are twitchy variations. If the PCs are all in one room, with one person hovering over them and doing it all manually by short-range sneakernet, without even the traditional collection of shell scripts from Hell, well, gah. It's on the cusp. I'd say he (or she) is doing enough work that it's called whatever she or he likes.

    Here's another example from a different viewpoint: Two separate systems, running totally different programs that never communicate with each other, but linked so that failover for HA works between the systems. That's almost always a cluster by my definition because -- No, not that you're running a dinky little heartbeat exchange program in parallel across them; that's necessary, but far from sufficient. I'd point to the tremendous amount of administrative synchronization needed between those two systems for failover to work. Consider: all the flags, options, licenses, disk access paths, etc., must be set up so that the failed-over program still works the same way when it fails over. *Every* commercial HA cluster product provides some help for doing that, and achieving it definitely results in tight administrative coupling. (For example, Microsoft Wolfpack synchronizes registry changes among the machines, using two-phase commit.)

    Maybe there are similar support programs that always get used with PVM and MPI. I don't know; I've not personally used them. If that support is present, then I'd say every MPI/PVM/whatnot-coupled collection of whatever, whether physically coupled by HyperOptiPetaNet or trained gerbils carrying little memos in their teeth, is a cluster.

    Possibly there are words that would make this point clearer. I was thinking of this kind of thing as being an implication of the phrase "common administration," although it isn't, I'll admit, what necessarily springs to mind given that phrase.


  39. Niraj Srivastava <niraj@clam.com> responds to Greg Pfister (See message 38).

    I agree a bunch of PC connected by ethernet is NOT a cluster, but PCs with Microsoft Cluster Services (MSCS) or LSF is a cluster. So we are back to a collection of interconnected computing devices with IPC, I/O and cluster management capabilities.


  40. Greg Lindahl <lindahl@cs.virginia.edu> reponds to Niraj Srivastava (See message 39).

    I suspect that most people who have a bunch of PCs connected by ethernet and call it a cluster have other software involved. If you have common administration and a copy of mpich, doesn't that make it a (single-user) cluster?


  41. TFCC Co-chair Rajkumar Buyya <rajkumar@dgs.monash.edu.au> again tries to summarize the discussion.

    I am trying to reorganize the cluster definition based on your comments. Hopefully it looks better than earlier version. Here it is:

    A cluster is a type of parallel or distributed processing system which:
    * consists of a collection of (inter)connected stand-alone computers, and
    * working together as a single(/unified) resource providing (**/offering/supporting**) capability, availability, and(/or) scalability (**services/functionalities**).

    We need to choose suitable words enclosed within brackets or discard them.

    >I think we've already discussed the problems with making what clusters are *for* part of the definition. They are used for many things; the list is open-ended.
    >For example, I sent out a list of 16 things clusters are used "for." Some of them aren't covered by your list. I make no claim that my list is comprehensive. [from Greg Pfister - see message 33.]

    Yes, we started discussion with that. I think the many reasons for using clusters can fall into one of the following:

    Transparancy (or SSI)
    capability
    availability
    scalability

    We covered all of them in definition.

    >Also, you said "networked" rather than "connected." I prefer the latter, since "networked" could be interpreted as "being part of an IP network," the way PCs connected to a server are referred to as "networked." [from Greg Pfister - see message 33.]

    Yes, let us use "(inter)connected".

    >You also said "working together as a single system." That's not bad, but"used as a single resource" is, I think, slightly and appropriately more general.
    >This of course depends on what you mean by "system" and "resource." Do you call a batch farm a "batch system" or a "computing resource"?
    >Is an HA file server a "shared file system" or a resource? Both words are overused, I think.
    >My own tendency is to shy away from any term that slides into "single system image," which I believe has been misused terribly; that's really why I prefer "resource."

    No problem. Let us use "single/unified resource".

    Others have already commented on your "Second" thought. Basically, "CORE" part of definition is yours (Greg Pfister) and than we added other words by picking thoughts/comments from active members of list. Others who have not yet voiced over cluster definition, please do.


  42. Niraj Srivastava <niraj@clam.com> responds to Greg Lindahl (See message 40).

    Yes, there is some glue to provide common adminstration. TFCC should be in the position to defined minimum set of services that this glue provides to transform a collection of interconnected computers to a Cluster Computing Environment.


  43. Greg Lindahl <lindahl@cs.virginia.edu> responds to Niraj Srivastava (See message 42).

    That sounds dangerously close to attempting to define a "single system image". I suspect that whatever minimum set we define will exceed the state of practice of many things called "clusters" today.

    For example, you might include "similar user accounts on all machines" or a "shared filesystem" on your list. These are desirable in most instances. But there is software which runs applications out of a "generic user" account, and there are clusters which always run applications which don't need a shared filesystem. In fact, most MPI programs do I/O in only 1 process; most low-end scientific clusters have a shared filesystem only to get the user program distributed among the nodes. So I would be hard pressed to define that minimum set of services as anything other than "whatever services are necessary to use the cluster as a cluster."


  44. Greg Pfister <pfister@us.ibm.com> responds to Rajkumar Buyya's latest definition of cluster (See message 41).

    >I think many reasons for using clusters can fall into one of the following:
    >Transparancy (or SSI)
    >capability
    >availability
    >scalability
    >We covered all of them in definition.

    Two comments:

    First, and primarily, the fact that, as you say, *many* -- but not *all* -- are covered. If the definition says clusters provide those things, it tells me that something which doesn't provide them is not a cluster. What else can it mean? If a cluster is something "providing capability [whatever that is], availability, ..." then something that doesn't provide those things is NOT a cluster.

    Second, "many" is in the eye of the beholder. According to a survey done last year by IDC, the primary reason for the majority of cluster installations in the industry is not even listed: It's system administration. Sweeping a bunch of campus-distributed servers into one pile can, again from the survey, produce a *seven* times decrease in administration cost.

    Between those two elements, the definition you've provided rules out the majority of clusters actually in use in the industry.

    Obviously, I brought that one up because it's not on your list.

    There's also workload isolation. And probably others.

    What is so distasteful about taking the uses of these things out of the definition?


  45. Giovanni Chiola, University of Genoa <chiola@disi.unige.it> responds to Rajkumar Buyya's latest definition of cluster (See message 41).

    >A cluster is a type of parallel or distributed processing system which:
    >* consists of a collection of (inter)connected stand-alone computers, and
    >* working together as a single(/unified) resource providing (**/offering/supporting**) capability, availability, and(/or) scalability (**services/functionalities**).

    Raj, I hesitated so far in joining the debate because I think that coming to a good definition is very difficult. Maybe we should keep improving our definition as a backgroung activity of the TFCC.

    I have no strong feelings about your first "*". I don't see any real difference between "stand-alone" versus "whole" or between "networked" versus "interconnected". I don't think the point is whether the network should be IP based or not. We should avoid a definition based on current technology. We should also avoid to make a differentiation between Clusters, Parallel Systems and Distributed Systems based only on technological considerations (that will ineitably become obsolete very soon).

    Concerning the second "*" I don't like the "single/unified resource" idea. Clusters can provide different resources that can be of value to users, such as RAM space, disk space, CPU power, communication bandwidth, etc. I'd rather formulate the idea in terms of "unified set of resources".

    Two things are still missing from this definition, that have been suggested by Greg [Pfister] and Tim [Mattson]. First, the explicit comparison with PP and DS. Second, the emphasys on "off-the-shelf components".

    Let me start with the "off-the-shelf" issue. I totally agree with Tim [Mattson] that is an essential characteristics of current Clusters.

    If we disagree in using this term, then I would suggest to look at the issue from a different point of view. Today the use of "off-the-shelf" components is almost mandatory if you want to achieve cost-effectiveness. The adoption of off-the-shelf components (both hardware and software, I agree on this remark) allows you to put together cheap components that offer very good performance.

    I would suggest thus to add the "cost-effective" concept as one of the major characteristics for Clusters. Todays this implies the adoption of off-the-shelf components in 90% of the cases, while it leaves the option of considering genuine "Clusters" also things that include proprietary, special purpose components if they are as cost-effective as the commodity ones.

    The focus on cost-effectiveness could also help us in finding our way to distinguish the TFCC from the TCPP and the TCDS, even if there is an overlapping in technology, applications, etc. The parallel systems aim at maximum performance, no matter the cost of the platform. The distributed systems aim at distribution as a value in itself, almost independently of cost and performance considerations. We focus on distributed systems that allow efficient parallel processing at low cost.

    Notice that for me cost is not only the cost of buying the hardware/ software components. This cost could also reduce to zero if you assemble your cluster out of an exhisting set of PCs and LANs in your Department. I include in my cost definition also the time that you spend to select the components to assemble, to put them together, to run and maintain the cluster. So, the single-image feature, administration tools, etc., are all parts of a Cluster that contribute to make it more and more cost-effective from the user point of view, as compared to what we wouldn't like to call a Cluster.

    For me the T3E and the SP are not Clusters because they are not as flexible and cost-effective as the "real clusters" currently are.


  46. Greg Lindahl <lindahl@cs.virginia.edu> responds to Giovanni Chiola (See message 45).

    >Let me start with the "off-the-shelf" issue. I totally agree with Tim that is an essential characteristics of current Clusters.

    I use the term "commodity cluster" when I want to talk about clusters built with off-the-shelf parts. I claim that commodity clusters are more cost-effective than other clusters, but this is a debatable point. I don't think we should enshrine the answer to the debate in our definition of all clusters.

    In addition, it is *likely* that a high-availability cluster ( as the term is used today ) has non-off-the-shelf parts. Those of us coming into the cluster community from the scientific computing community should keep in mind that there's more than one type of cluster.

    >For me the T3E and the SP are not Clusters because they are not as flexible and cost-effective as the "real clusters" currently are.

    My company builds clusters which are extremely similar to the IBM SP, out of commodity parts. The user gets the same programming model. The hardware architecture is the same. The system administration tools have similar functionality. The system software has similar functionality. Why is one system not a cluster, when the other is? I don't think "because I think one is too expensive" is a good answer. The IBM SP looks like a duck, and it quacks like a duck. It's a duck!


  47. Giovanni Chiola <chiola@disi.unige.it> responds to Greg Lindahl (See message 46).

    >Why is one system not a cluster, when the other is? I don't think "because I think one is too expensive" is a good answer.

    Instead I think this is the only reasonable answer.


  48. Greg Lindahl <lindahl@cs.virginia.edu> responds to Giovanni Chiola (See message 47).

    >Instead I think this is the only reasonable answer.

    What's the problem with calling off-the-shelf, inexpensive clusters "commodity clusters"?


  49. Giovanni Chiola <chiola@disi.unige.it> responds to Greg Lindahl (See message 48).

    This is why, after thinking carefully about the topic, I am now suggesting to stress the "cost-effectiveness" as one of the main cluster characteristics rather than the "commodity components". Indeed, using commodity components is one way (today the main one) to achieve cost-effectiveness. Cost-effectiveness is an aspect that is not crucial neither for TCPP nor for TCDS, while it is crucial in most clusters, starting from Beowulf (which in my opinion is the "main example" of a cluster). For particular purposes even the use of non-commodity components could be cost-effective compared to other solutions, hence interesting for the TFCC.


  50. Greg Pfister <pfister@us.ibm.com> responds to Giovanni Chiola (See message 49).

    If the only clusters allowed by this task force are those buildable by academics on a shoestring budget, I'm out of here. And I would wager that most of your industry-side members will be, too.


  51. Marcin Paprzycki, University of Southern Mississippi <marcin@orca.st.usm.edu> responds to Greg Pfister (See message 50).

    After being in San Antonio on the SIAM Parallel meeting I returned and started to catch with e-mails from the past. Reading the definition oriented messages was fascinating. It seems that we are really going somewhat in circles with few players and many spectators. And even though the definition seems to elude us I believe that this very discussion is rather valuable. But then I came across this message:

    On Fri, 26 Mar 1999 pfister@us.ibm.com wrote:
    >If the only clusters allowed by this task force are those buildable by academics on a shoestring budget, I'm out of here.
    >And I would wager that most of your industry-side members will be, too.

    and I started to have a very bad feeling. There is a touch of a threat in the tone of it and an apperance of a schism between the industry (with money) and the academia (without money). Any hints of this are very bad for the TFCC!

    I would like to suggest that we all step back, take a deep breadth, read the messages that appeared on the list thus far and do some thinking for a couple of days (a cooling-down period). Then we should come back and start looking for new ways of attacking the problem.

    I would laso like to suggest that Mark and Rajkumar try to summarize what was said this far, what were the most important points/issues, pros and cons and present us with such a summary. This may also give us some food for thoughts.


  52. Giovanni Chiola <chiola@disi.unige.it> responds to Marcin Paprzycki (See message 51).

    I agree. This is not the way one can contribute to a free, constructive discussion.

    Nobody said "if you include the SP in the cluster definition I'm out of here" (even if I think that it is not reasonable to include the SP in the cluster definition, in the same way it is not reasonable to include the T3E, the CM5, the CS, etc.). I think we should leave the MPPs to the TCPP, and claim only the "real clusters" for the TFCC, otherwise it would be really difficult for us to establish the new task force. Anyway, if the majority of the TFCC members is interested in substituting the TCPP I think this is no good reason to consider myself "out of here."

    So, if we want to discuss an effective strategy to differentiate Clusters (and the TFCC) from parallel platforms (and the TCPP), I think we should consider a distinctive characteristics of most clusters as opposed to MPPs. And I think it is not acceptable to answer in that way to people that propose such a classification.

    The first concern is whether the TFCC should exhist or not, and whether it should become a TCCC in the near future or not. The fact that somebody is "in" or "out" at the moment is of marginal concern in my opinion. So, if somebody has good, convincing arguments to support the claim that cost-effectiveness is not a major issue in Clusters, please disseminate these arguments.

    "I'm not interested in cost-effectiveness because I've got enough budget" is not a good argument in my opinion. You may have a large budget to run your own lab, but your customers may not have enough budget to buy your "expensive clusters", so I think cost-effectiveness is of primary concern to industry as well (but, of course, this is just an external opinion from a person that has no experience in real business).

    On the other hand, I don't think this divergence of opinion is related to the affiliation to academy or industry. I remind that the remark on off-the-shelf components came from an Intel guy (which in my opinion is no less industry than IBM).


  53. TFCC Co-chair Rajkumar Buyya <rajkumar@dgs.monash.edu.au> responds to Giovanni Chiola (See message 52).

    Let us try to avoid comparing one industry with other industry. Opinions expressed on this mailing list are personal. Just because a researcher from ABC company said something, does not mean that its is opinion of ABC company. Please do not think so.

    Kindly avoid statements like "XYZ is no less than ABC company". We have had a great discussion on cluster computing.

    We all know that "cluster computing: a wave of future". Let us ensure that our Task Force contributes to that. May be we can avoid statements like "one TC replacing other TC or so".

    Many clusters have been built (will be built) both in industry and academia (Uni.) at different cost:
    - some follow 100% commodity path
    - some do zero investment (use existing infrastructure)
    - some are investing X amount and some Y amount
    - Some use PC/Workstation/SMPs

    -- Linux
    -- Solaris
    -- NT, etc. etc.
    - Interconnection Network can be:
    Ethernet
    Fast Ethernet
    ATM
    Myrinet
    SCI
    ............
    ............
    Private networks (for performance reason--performance at any cost).
    Customers choose right kind of setting based on FUNDs availability and requirements.

    Companies have even done Clustering of Mainframes during 1960s. Cluster computing has gained momentum due to feasibility of Clustering of Desktop Computers. We all know reasons for this.

    We all want "cost-effective" solution and want to use commodity components to leverage existing technology and also take advantage of rapid advances in commodity components.

    For some, Clusters based in Myrinet is cost-effective and for others clusters based on Ethernet is cost-effective. We pay different price to get them. It all depends on what we want out of the system, which we can build ourself or purchase.

    So, let us try to come out with conceptual "Cluster Definition".


  54. Dan Hyde of Bucknell University <hyde at bucknell.edu> wants to contrast "cluster" with "distributed systems."

    I have followed the "Define Cluster" debate with GREAT interest and have found the discussions very stimulating!

    I am fairly happy with Greg Pfister's original definition of a cluster.

    "A cluster is a type of parallel or distributed system that consists of a collection of interconnected whole computers used as a single, unified computing resource." Greg Pfister

    However, I think a lot of the efforts to "improve" on Greg's definition have not helped much. However, it has been a healthy discussion and worth while.

    I want to take a different approach. I would like to contrast Greg's definition with a published definition of "distributed systems." Below is a definition from a book I am sure many of you will recognize.

    "We define a distributed system as a collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility."

    from "Distributed Systems: Concepts and Design," second edition by George Coulouris, Jean Dollimore, and Tim Kindberg, Addison-Wesley, 1994, page 1. Notice that the definition is over 5 years old.

    At first glance, there is a lot of similarities. Since we certainly don't want clusters = distributed systems, we need to study the subtle differences.

    1. Our discussion of "whole computer" to distinguish cluster from other beasts seems appropriate.

    2. Also, clusters are interconnected and not necessarily networked.

    3. The phrase "used as a single, unified computing resource" has a different meaning from "with software designed to produce an integrated computing facility" but seems close.

    The three help to distinguish clusters from distributed systems, but I suspect many customers, marketing types, etc. would have a hard time seeing any difference in these two definitions.

    Playing Devil's Advocate: Are clusters only a slight twist on distributed systems?? What do you think?


  55. Greg Lindahl <lindahl@cs.virginia.edu> responds to Dan Hyde (See message 54).

    >At first glance, there is a lot of similarities. Since we certainly don't want clusters = distributed systems, we need to study the subtle differences.

    Greg [Pfister] pointed out another difference which you neglected; less-reliable and slower communications links than clusters.


  56. TFCC Co-chair Rajkumar Buyya <rajkumar@dgs.monash.edu.au> responds to Dan Hyde (See message 54).

    We can add another distinguishing point:

    4. Unlike distributed system, nodes of a cluster (system) have "strong-sense of membership". (membership means belongingness)

    With this point added to Greg's definition and our earlier reorganized definition can be stated as:

    A cluster is a type of parallel or distributed processing system which
    a) consists of a collection of interconnected stand-alone/whole computers
    b) working together **with strong-sense of membership** as a single, unified computing resource, and
    c) (support capability, availability, and(/or) scalability services). OR (support services such as capability, availability, and scalability).

    I think point (a) we all agree !?

    The new point in the above definition (see b.)is "cluster nodes have strong-sense of membership". This point is very important and makes cluster nodes work cooperatively with strong-sense of belongingness/membership. This point need to be positioned suitably as part of (b).

    As Greg Pfister indicated there are several reasons for using clusters and it is hard to say which one is important. But we want clusters to exhibit certain key characteristics. So, instead of closed definition, we can create a little open definition by adding SECOND PART OF (c) point.

    What is your opinion ?


  57. Rajkumar Buyya <rajkumar@dgs.monash.edu.au> responds to Greg Pfister (See message 44).

    >Two comments: >First, and primarily, the fact that, as you say, *many* -- but not *all* -- are covered. If the definition says clusters provide those things, it tells me that something which doesn't provide them is not a cluster.
    >What else can it mean? If a cluster is something "providing capability [whatever that is], availability, ..." then something that doesn't provide those things is NOT a cluster.
    >
    >Second, "many" is in the eye of the beholder. According to a survey done last year by IDC, the primary reason for the majority of cluster installations in the industry is not even listed:
    >It's system administration. Sweeping a bunch of campus-distributed servers into one pile can, again from the survey, produce a *seven* times decrease in administration cost.

    I think ease of administration comes from SSI.

    >Between those two elements, the definition you've provided rules out the majority of clusters actually in use in the industry.
    >
    >Obviously, I brought that one up because it's not on your list.
    >
    >There's also workload isolation. And probably others.
    >
    >What is so distasteful about taking the uses of these things out of the definition?
    No problem. Many members of TFCC-L have indicated those three as primary features clusters need to provide. Yes, it is very difficult to list such things as part of definition.

    In my next email in response to Dan Hyde [See message 56], I will add one more point why clusters are different from distributed systems.


  58. Orly Kremien <orly@macs.biu.ac.il> responds to Dan Hyde (See message 54).

    Dan, I believe clusters are only a slight twist on distributed systems. As I see it, they are composed of autonomuos off-the-shelf computers (i.e. PCs) interconnected by high speed off-the-shelf interconnection (i.e. Fast Ethernet switch), with off-the-shelf software designed to produce an integrated computing facility(i.e. PVM). I see it as making High-Performance-Computing (HPC) available at ease.


Includes the TFCC-L "definition of cluster" messages as of March 31, 1999.

Back to TFCC Newletter Vol. 1 No. 1.

Page maintained by Dan Hyde, hyde at bucknell.edu Last update March 31, 1999
__________
?