PlinyCompute is declarative in the large, native and tunable in the small. Particularly, it provides advanced users such as tool developers knobs to better control the memory management, resource utilization and so on. Here we describe those parameters in detail.
numThreads: Usually this value should be set to the number of CPU cores on a worker. We also find through our benchmarks that setting it to 1.5 times of number of CPU cores can often get the best results.
sharedMemSize: Usually this value should be set to half of the available memory on a worker. You can also try to increase the value to fit your dataset size.
For workloads that have join and aggregation computations, you need leave some memory for hashmaps, which are allocated on heap instead of in memory.
You can configure above parameters when using below scripts:
$PDB_HOME/scripts/startCluster.sh distributed masterIp keyFile numThreads sharedMemSize
$PDB_HOME/scripts/startWorkers masterIp keyFile numThreads sharedMemSize
number of partitions: by default, it assumes the total number of partitions for join and aggregation is 1.5 times of number of CPU cores. You can change the ratio by setting numPartitionRatio parameter when starting the pdb-manager:
$PDB_HOME/bin/pdb-manager hostname port pseudoModeOrNot pem_file numPartitionRatio
You need also modify this parameter in the scripts, e.g. startCluster.sh, startWorkers.sh
There are also many parameters which you can configure through Compiler flags:
Environment Variable | Meaning |
---|---|
DEFAULT_BATCH_SIZE |
The batch size of vectorized pipeline execution. (default: 100) |
EVICT_STOP_THRESHOLD |
The ratio of used shared memory to trigger or stop cache eviction. (default: 0.95) |
JOIN_HASH_TABLE_SIZE_RATIO |
The space inflation ratio for building Join hash table for broadcasting: the size of the Join hash table for broadcasting to the size of the data that is used to build the Join hash table. (default: 1.5) |
AUTO_TUNING |
When the system is running on Cluster Mode, it is recommended to add this flag, which will automatically detect worker nodes memory and hardware thread number, then compute required heap size for building hash maps. When the system is running on Pseudo-Cluster Mode, it is recommended to turn off this flag, and manually configure following parameters: DEFAULT_MEM_SIZE, DEFAULT_NUM_CORES, DEFAULT_HASH_PAGE_SIZE. (default: on) |
DEFAULT_HASH_PAGE_SIZE |
Memory size for building hashmaps on heap, it will only be used if AUTO_TUNING is not set. (default: 536870912) |
DEFAULT_MEM_SIZE |
Available memory size on worker, it will only be used if AUTO_TUNING is not set. (default: 73014444032) |
DEFAULT_NUM_CORES |
Available number of hardware threads on worker, it will only be used if AUTO_TUNING is not set. (default: 8) |
DEFAULT_PAGE_SIZE |
Default page size for creating a set. (default: 268435456) |
DEFAULT_MAX_PAGE_SIZE |
Max page size allowed for creating a set, shuffling, broadcasting. (default: 268435456) |
DEFAULT_SHUFFLE_PAGE_SIZE |
Default page size for shuffling. (default: 268435456) |
DEFAULT_BROADCAST_PAGE_SIZE |
Default page size for broadcasting. (default: 268435456) |
DEFAULT_MAX_CONNECTIONS |
Max number of concurrent workers. (default: 200) |