Deployment in a Real Cluster

Although running PlinyCompute in one machine (e.g. a laptop) is ideal for becoming familiar with the system and testing some of its functionality, PlinyCompute’s high-performance features are best suited for processing large data loads in a real distributed cluster such as Amazon AWS, on-premise, or other cloud provider. To accomplish this, follow these steps:

Log into a remote machine that will serve as the manager node (e.g. Amazon AWS).

Once logged in, clone PlinyCompute from GitHub, issuing the following command:

$ git clone https://github.com/riceplinygroup/plinycompute.git

This command downloads PlinyCompute in a folder named plinycompute. Make sure you are in that directory. In a Linux machine the prompt should look something similar to:

ubuntu@manager:~/plinycompute$

Set the following two environment variables: a) PDB_HOME, this is the path to the folder where PlinyCompute was cloned, in this example /home/ubuntu/plinycompute, and b) PDB_INSTALL, this is the path to a folder in the worker nodes (remote machines) where PlinyCompute executables will be installed. Note: the value of these variables is arbitrary (and they do not have to match), but make sure that you have proper permissions on the remote machines to create folders and write to files. In this example, PlinyCompute is installed on /home/ubuntu/plinycompute on the manager node, and on /tmp/pdb_install in the worker nodes.

export PDB_HOME=/home/ubuntu/plinycompute
export PDB_INSTALL=/tmp/pdb_install

Edit the conf/serverlist file with the IP addresses of the worker nodes (machines) in the cluster; one IP address per line. The content of the file should look similar to this one (replace the IP’s with your own, the ones shown here are ficticious):

192.168.1.1
192.168.1.2
192.168.1.3

In the above example, the cluster will include one manager node (where PlinyCompute) was cloned, and three worker nodes, whose IP addresses can be found in the conf/serverlist file.

Invoke cmake with the following command:

$ cmake -DUSE_DEBUG:BOOL=OFF .

Build the following make target replacing the value of the -j argument with an integer to execute multiple recipes in parallel:

$ make -j 4 pdb-main

This will generate two executables in the folder $PDB_HOME/bin:

pdb-manager
pdb-worker

Run the following script. This script will connect to each of the worker nodes and install PlinyCompute.

$ $PDB_HOME/scripts/install.sh conf/pdb-key.pem

This script generates an output similar to one below, for all nodes in the cluster (partial display shown here for clarity purposes):

.
.
.
---------------------------------
Results of script install.sh:
*** Failed results (0/3) ***

*** Successful results (3/3) ***
Worker node with IP: 192.168.1.1 successfully installed.
Worker node with IP: 192.168.1.2 successfully installed.
Worker node with IP: 192.168.1.3 successfully installed.
---------------------------------

Launch the PlinyCompute cluster with this script:

$ $PDB_HOME/scripts/startCluster.sh <cluster_type> <manager_node_ip> <pem_file> [num_threads] [shared_memory]

Where the following arguments are required:
cluster_type should be distributed
manager_node_ip the public IP address of the manager node
pem_file the pem file that allows to connect to the machines in the cluster

The last two arguments are optional:
num_threads number of CPU cores on each worker node that PlinyCompute will use
shared_memory amount of RAM on each worker node (in Megabytes) that PlinyCompute will use

In the following example, the public IP address of the manager node is 192.168.1.0; the pem file is conf/pdb-key.pem; by default the cluster is launched with 1 thread and 2Gb of memory.

$ $PDB_HOME/scripts/startCluster.sh distributed 192.168.1.0 conf/pdb-key.pem

After completion, the output should look similar to the one below, (partial display shown here for clarity purposes):

.
.
.
---------------------------------
Results of script startCluster.sh:
*** Failed results (0/3) ***

*** Successful results (3/3) ***
Worker node with IP: 192.168.1.1 successfully started.
Worker node with IP: 192.168.1.2 successfully started.
Worker node with IP: 192.168.1.3 successfully started.

---------------------------------

At this point a distributed version of PlinyCompute has been successfully launched!