The software on this page is provided on an as is basis for research purposes. There is no additional support offered, nor are the author(s) or their institutions liable under any circumstances.


ERA: Efficient Serial and Parallel Suffix Tree Construction​

We developed a disk-based suffix tree construction method, called Elastic Range (ERA), which works efficiently with very long strings that are much larger than the available memory. ERA partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERA also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERA indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.



Relevant Data Sets

Human Genome (2.6GB8MB) , Protein (32MB) and English text from Wikipedia (1GB)​

Relevant Publications

Essam Mansour, Amin Allam, Spiros Skiadopoulos, Panos Kalnis: ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings. PVLDB 5(1): 49-60 (2011)  (PDF) (BibTeX)


ACME: Efficient Parallel Motif Extraction for Very Long Sequences​

ACME is an advanced parallel motif extractor. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90\% speedup in parallel execution. ACME is the only method that: (i) scales to gigabyte-long sequences (e.g., entire human genome); (ii) scales to large alphabets (e.g., English alphabet for Wikipedia); (iii) supports interesting types of motifs (e.g., supermaximal motifs) with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud and supercomputers. Compared to the current state of the art, ACME reduces the extraction time for an exact-length query from almost 4 hours to 7 minutes on a typical workstation; can handle 3 orders of magnitude longer sequences; and scales up to 16,384 cores on a supercomputer..



Relevant Data Sets

Human Genome (2.6GB, 8MB) , Protein (32MB) and English text from Wikipedia​ (1GB)

Relevant Publications

Majed SahliEssam Mansour, Panos Kalnis: ACME: Efficient Parallel Motif Extraction from Very Long Sequences. Technical Report  (PDF​)




Mizan is an advanced clone to Google’s graph processing system Pregel that utilizes online graph vertex migrations to dynamically optimizes the execution of graph algorithms. You can use our Mizan system to develop any vertex centric graph algorithm and run in parallel over a local cluster or over cloud infrastructure. Mizan is compatible with Pregel’s API, written in C++ and uses MPICH2 for communication.

Download MIZAN

Try MIZAN on EC2

Mizan is published in EuroSys 13
Adaptive Parallelism for Scientific Applications
Scientific applications demand large computation and have data parallelism by nature. Parallel scientific applications are developed to leverage supercomputers and cloud infrastructures. The ideal amount of computing resources to use varies with queries workload and computing infrastructures.Furthermore, users may have limited budget or time constraints.This paper presents APlug, an adaptive parallelism framework for scientific applications. APlug estimates the degree of parallelism and dynamically recommends the number of cores considering workload per query, user constraints, and infrastructure. We utilize a Gamma probability density function to eliminate estimation errors due to extreme skewness inherent in the set of tasks. Our experiments show the viability of our framework for molecular docking, sequence alignment, and string mining under large-scale infrastructures; 16,384 CPUs in a supercomputer and 480 cores in a Linux cluster. APlug estimates the degree of parallelism in very short time with less than 10% error.​

GraMi is a novel framework for frequent subgraph mining in a single large graph, GraMi outperforms existing techniques by 2 orders of magnitudes, it supports finding frequent subgraphs as well as frequent patterns, Compared to subgraphs, patterns offer a more powerful version of matching that captures transitive interactions between graph nodes (like friend of a friend) which are very common in modern applications. Also, GraMi supports user-defined structural and semantic constraints over the results, as well as approximate results. ​


GraMi on Amazon EC2:
AMI ID: ami-90fc15f8 

For more information on how to launch an AMI on Amazon EC2, check the following link: 

Once you launch GraMi instance , you can connect to it using the following ssh command:
ssh [-i identity_file] [user@]hostname
Click here for more information on how to connect to an EC2 instance using SSH. 

After connecting successfully to the server, go into grami folder:
cd ./grami 

Finally, try the following example:
./grami -f citeseer.lg -s 160 -t 1 -p 1 -d 200 

Notice that as you use larger graphs, you will need large memory. Make sure that your EC2 instance has enough RAM, or try to increase java heap. Increase java heap space 

You can check the result by opening the output file: 'Output.txt'. The first line shows the elapsed time in seconds, the second line contains the number of frequent subgraphs, then there is the list of frequent subgraphs. 

Relevant Publications:
Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis. GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph. PVLDB, 7(7):517-528, 2014.

For more information about GraMi, check this link
For questions or comments contact Ehab Adbelhamid:
Pathogens infect their host and hijack the host machinery to produce more progeny pathogens. Obligate intracellular pathogens, in particular, require resources of the host to replicate. Therefore, infections by these pathogens lead to alterations in the metabolism of the host, shifting in favor of pathogen protein production. Some computational identification of mechanisms of host-pathogen interactions have been proposed, but it seems the problem has yet to be approached from the metabolite-hijacking angle.

We propose a novel computational framework, Hi-Jack, for inferring pathway-based interactions between a host and a pathogen that relies on the idea of metabolite hijacking. Hi-Jack searches metabolic network data from hosts and pathogens, and identifies candidate reactions where hijacking occurs. A novel scoring function ranks candidate hijacked reactions and identifies pathways in the host that interact with pathways in the pathogen, as well as the associated frequent hijacked metabolites. We also describe host-pathogen interaction principles that can be used in the future for subsequent studies. Our case study on Mycobacterium tuberculosis (Mtb) revealed pathways in human---e.g., carbohydrate metabolism, lipids metabolism, and pathways related to amino acids metabolism---that are likely to be hijacked by the pathogen. In addition, we report interesting potential pathway interconnections between human and Mtb such as linkage of human fatty acid biosynthesis with Mtb biosynthesis of unsaturated fatty acids, or linkage of human pentose phosphate pathway with lipopolysaccharide biosynthesis in Mtb.

Software Link
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data pre-processing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes.

In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (i) starts faster than all existing systems; (ii) processes thousands of queries before other systems become online; and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. ​

Note: all dependencies are detailed in the file.

The manuscript is available here AdPart (1).pdf​

Relevant publications:
1) Razen Harbi, Ibrahim Abdelaziz, Panos Kalnis, Nikos Mamoulis, Yasser Ebrahim, Majed Sahli. Adaptive Partitioning for Very Large RDF Data. CoRR abs/1505.02728 (2015)
2) Razen Harbi, Ibrahim Abdelaziz, Panos Kalnis, Nikos Mamoulis. Evaluating SPARQL Queries on Massive RDF Datasets. PVLDB 8(12): 1848-1859 (2015)​
A growing number of applications require combining SPARQL queries with generic graph search on RDF data. However, the lack of procedural capabilities in SPARQL makes it inappropriate for graph analytics. Moreover, RDF engines focus on SPARQL query evaluation whereas graph management frameworks perform only generic graph computations. In this work, we bridge the gap by introducing SPARTex, an RDF analytics framework based on the vertex-centric computation model. In SPARTex, user-defined vertex-centric programs can be invoked from SPARQL as stored procedures. SPARTex allows the execution of a pipeline of graph algorithms without the need for multiple reads/writes of input data and intermediate results. We use a cost-based optimizer for minimizing the communication cost. SPARTex evaluates queries that combine SPARQL and generic graph computations orders of magnitude faster than existing RDF engines.

coming soon

Relevant Publications:
Ibrahim Abdelaziz, Razen Harbi, Semih Salihoglu, Panos Kalnis and Nikos Mamoulis. SPARTex: A Vertex-Centric Framework for RDF Data Analytics. PVLDB 8(12): 1880-1891 (2015)​
Sequences and applications using them are proliferating in science and business. Currently, sequences are stored in file systems and processed using ad-hoc procedural code. Existing techniques are not flexible and cannot efficiently handle complex queries or large datasets. In this paper, we demonstrate StarDB, a distributed database system for analytics on sequences. StarDB hides data and system complexities and allows users to focus on analytics. It uses a comprehensive set of parallel string operations and provides a declarative query language to solve complex queries. StarDB automatically tunes itself and runs with over 90\% efficiency on supercomputers, public clouds, clusters, and workstations. We test StarDB using real datasets that are 2 orders of magnitude larger than the datasets reported by previous works.


Relevant publications:
Majed Sahli, Essam Mansour, Panos Kalnis. StarDB: A Large-Scale DBMS for Strings. PVLDB 8(12): 1844-1855 (2015)​
PEDAL: Discriminative identification of transcriptional responses of promoters and enhancers after stimulus

Promoters and enhancers regulate the initiation of gene expression and maintenance of expression levels in spatial and temporal manner. Recent findings stemming from the Cap Analysis of Gene Expression (CAGE) demonstrate that promoters and enhancers, based on their expression profiles after stimulus, belong to different transcription response subclasses. One of the most promising biological features that might explain the difference in transcriptional response between subclasses is the local chromatin environment. We introduce a novel computational framework, PEDAL, for distinguishing effectively transcriptional profiles of promoters and enhancers using solely histone modification marks, chromatin accessibility and binding sites of transcription factors and co-activators. A case study on data from MCF-7 cell-line reveals that PEDAL can identify successfully the transcription response subclasses of promoters and enhancers from two different stimulations. Moreover, we report subsets of input markers that discriminate with minimized classification error MCF-7 promoter and enhancer transcription response subclasses. Our work provides a general computational approach for identifying effectively cell-specific and stimulation-specific promoter and enhancer transcriptional profiles, and thus, contributes to improve our understanding of transcriptional activation in human.​

This is a collaborative research project between KAUST (prof Vladimir Bajic and prof Panos Kalnis) and RIKEN Yokohama labs (Dr Erik Arner).
Comments are welcome. Please email to if you have any enquiry.  

Supplementary Information