publications
Publications by categories in reversed chronological order. Generated by jekyll-scholar.
An asterisk (*) indicates equal contribution to a scholarly work
NGS & Computational pipelines
-
A broad survey of DNA sequence data simulation tools Alosaimi, Shatha, Bandiang, Armand, Biljon, Noelle, Awany, Denis, Thami, Prisca K, Tchamga, Milaine S S, Kiran, Anmol, Messaoud, Olfa, Hassan, Radia Ismaeel Mohammed, Mugo, Jacquiline, Ahmed, Azza, Bope, Christian D, Allali, Imane, Mazandu, Gaston K, Mulder, Nicola J, and Chimusa, Emile R Briefings in Functional Genomics 2019 [Abs] [HTML]
In silico DNA sequence generation is a powerful technology to evaluate and validate bioinformatics tools, and accordingly more than 35 DNA sequence simulation tools have been developed. With such a diverse array of tools to choose from, an important question is: Which tool should be used for a desired outcome? This question is largely unanswered as documentation for many of these DNA simulation tools is sparse. To address this, we performed a review of DNA sequence simulation tools developed to date and evaluated 20 state-of-art DNA sequence simulation tools on their ability to produce accurate reads based on their implemented sequence error model. We provide a succinct description of each tool and suggest which tool is most appropriate for the given different scenarios. Given the multitude of similar yet non-identical tools, researchers can use this review as a guide to inform their choice of DNA sequence simulation tool. This paves the way towards assessing existing tools in a unified framework, as well as enabling different simulation scenario analysis within the same framework.
-
Managing genomic variant calling workflows with Swift/T Ahmed, Azza E., Heldenbrand, Jacob, Asmann, Yan, Fadlelmola, Faisal M., Katz, Daniel S., Kendig, Katherine, Kendzior, Matthew C., Li, Tiffany, Ren, Yingxue, Rodriguez, Elliott, Weber, Matthew R., Wozniak, Justin M., Zermeno, Jennie, and Mainzer, Liudmila S. PLOS ONE 2019 [Abs] [bioRxiv] [HTML] [PDF]
Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.
-
SVCurator: A Crowdsourcing app to visualize evidence of structural variants for the human genome Chapman, Lesley M, Spies, Noah, Pai, Patrick, Lim, Chun Shen, Carroll, Andrew, Narzisi, Giuseppe, Watson, Christopher M., Proukakis, Christos, Clarke, Wayne E., Nariai, Naoki, Dawson, Eric, Jones, Garan, Blankenberg, Daniel, Brueffer, Christian, Xiao, Chunlin, Kolora, Sree Rohit Raj, Alexander, Noah, Wolujewicz, Paul, Ahmed, Azza, Smith, Graeme, Shehreen, Saadlee, Wenger, Aaron M., Salit, Marc, and Zook, Justin M. bioRxiv 2019 [Abs] [HTML] [PDF]
A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is yet to be defined. In this study, we manually curated 1235 SVs which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app - SVCurator - to help curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. The crowdsourced results were highly concordant with 37 out of the 61 curators having at least 78% concordance with a set of expert curators, where there was 93% concordance amongst expert curators. This produced high confidence labels for 935 events. When compared to the heuristic-based draft benchmark SV callset from GIAB, the SVCurator crowdsourced labels were 94.5% concordant with the benchmark set. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.
-
Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics Baichoo, Shakuntala*, Souilmi, Yassine*, Panji, Sumir*, Botha, Gerrit, Meintjes, Ayton, Hazelhurst, Scott, Bendou, Hocine, Beste, Eugene de, Mpangase, Phelelani T., Souiai, Oussema, Alghali, Mustafa, Yi, Long, O’Connor, Brian D., Crusoe, Michael, Armstrong, Don, Aron, Shaun, Joubert, Fourie, Ahmed, Azza E., Mbiyavanga, Mamana, Heusden, Peter van, Magosi, Lerato E., Zermeno, Jennie, Mainzer, Liudmila Sergeevna, Fadlelmola, Faisal M., Jongeneel, C. Victor, and Mulder, Nicola BMC Bioinformatics 2018 [Abs] [HTML] [PDF]
The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.
Education & Scientific communications
-
Delivering blended bioinformatics training in resource-limited settings: a case study on the University of Khartoum H3ABioNet node Ahmed, Azza E, Awadallah, Ayah A, Tagelsir, Mawada, Suliman, Maram A, Eltigani, Atheer, Elsafi, Hassan, Hamdelnile, Basil D, Mukhtar, Mohamed A, and Fadlelmola, Faisal M Briefings in bioinformatics 2019 [Abs] [bioRxiv] [HTML] [PDF]
Motivation: Delivering high-quality distance-based courses in resource-limited settings is a challenging task. Besides the needed infrastructure and expertise, effective delivery of a bioinformatics course could benefit from hands-on sessions, interactivity and problem-based learning approaches. Results: In this article, we discuss the challenges and best practices in delivering bioinformatics training in resource-limited settings taking the example of hosting and running a multiple-delivery online course, Introduction to Bioinformatics, that was developed by the H3ABioNet Education and Training working group and delivered in 27 remote classrooms across Africa in 2017. We take the case of the University of Khartoum classrooms. Believing that our local setting is similar to others in less-developed countries, we also reflect upon aspects like classroom environment and recruitment of students to maximize outcomes. Keywords: bioinformatics training, blended learning, bMOOC, distance-based learning, resource-limited settings
-
Organizing and running bioinformatics hackathons within Africa: The H3ABioNet cloud computing experience Ahmed, Azza* E, Mpangase, Phelelani* T, Panji, Sumir, Baichoo, Shakuntala, Botha, Gerrit, Fadlelmola, Faisal M, Hazelhurst, Scott, Van Heusden, Peter, Jongeneel, C Victor, Joubert, Fourie, and others, AAS Open Research 2018 [Abs] [HTML] [PDF]
The need for portable and reproducible genomics analysis pipelines is growing globally as well as in Africa, especially with the growth of collaborative projects like the Human Health and Heredity in Africa Consortium (H3Africa). The Pan-African H3Africa Bioinformatics Network (H3ABioNet) recognized the need for portable, reproducible pipelines adapted to heterogeneous compute environments, and for the nurturing of technical expertise in workflow languages and containerization technologies. To address this need, in 2016 H3ABioNet arranged its first Cloud Computing and Reproducible Workflows Hackathon, with the purpose of building key genomics analysis pipelines able to run on heterogeneous computing environments and meeting the needs of H3Africa research projects. This paper describes the preparations for this hackathon and reflects upon the lessons learned about its impact on building the technical and scientific expertise of African researchers. The workflows developed were made publicly available in GitHub repositories and deposited as container images on quay.io.
-
Development of Bioinformatics Infrastructure for Genomics Research Mulder, Nicola J, Adebiyi, Ezekiel, Adebiyi, Marion, Adeyemi, Seun, Ahmed, Azza, Ahmed, Rehab, Akanle, Bola, Alibi, Mohamed, Armstrong, Don L, Aron, Shaun, and others, Global heart 2017 [Abs] [HTML] [PDF]
Background: Although pockets of bioinformatics excellence have developed in Africa, generally, large-scale genomic data analysis has been limited by the availability of expertise and infrastructure. H3ABioNet, a pan-African bioinformatics network, was established to build capacity specifically to enable H3Africa (Human Heredity and Health in Africa) researchers to analyze their data in Africa. Since the inception of the H3Africa initiative, H3ABioNet’s role has evolved in response to changing needs from the consortium and the African bioinformatics community. Objectives: H3ABioNet set out to develop core bioinformatics infrastructure and capacity for genomics research in various aspects of data collection, transfer, storage, and analysis. Methods and Results: Various resources have been developed to address genomic data management and analysis needs of H3Africa researchers and other scientific communities on the continent. NetMap was developed and used to build an accurate picture of network performance within Africa and between Africa and the rest of the world, and Globus Online has been rolled out to facilitate data transfer. A participant recruitment database was developed to monitor participant enrollment, and data is being harmonized through the use of ontologies and controlled vocabularies. The standardized metadata will be integrated to provide a search facility for H3Africa data and biospecimens. Because H3Africa projects are generating large-scale genomic data, facilities for analysis and interpretation are critical. H3ABioNet is implementing several data analysis platforms that provide a large range of bioinformatics tools or workflows, such as Galaxy, the Job Management System, and eBiokits. A set of reproducible, portable, and cloud-scalable pipelines to support the multiple H3Africa data types are also being developed and dockerized to enable execution on multiple computing infrastructures. In addition, new tools have been developed for analysis of the uniquely divergent African data and for downstream interpretation of prioritized variants. To provide support for these and other bioinformatics queries, an online bioinformatics helpdesk backed by broad consortium expertise has been established. Further support is provided by means of various modes of bioinformatics training. Conclusions: For the past 4 years, the development of infrastructure support and human capacity through H3ABioNet, have significantly contributed to the establishment of African scientific networks, data analysis facilities, and training programs. Here, we describe the infrastructure and how it has affected genomics and bioinformatics research in Africa.