Bioinformatics Programming Libraries

A list of programming libraries for Bioinformatics concentrating on their applicability to protein sequence and structure analysis. Please send me links to any I am missing or any updates to the information provided!

Atomium
Python Ireland and Martin, 2019 (in press) Maintained. Updated 2019
  • structure
  • structure retrieval
Atomium is a modern, lightweight, Python library for parsing, manipulating, and saving PDB, mmCIF and MMTF file formats. atomium is implemented in Python and its performance is equivalent to BioPython. However, it has significant advantages in features and API design.
    
BALL
C++(Python) Hildebrandt et al., 2010 Not maintained. Updated 2014
  • structure
  • rendering
  • molecular mechanics
The Biochemical Algorithms Library (BALL) is a comprehensive rapid application development framework for structural bioinformatics. It provides a C++ class library of data structures and algorithms for molecular modelling and structural bioinformatics. The accompanying BALLView program allows rendering and visualization. The library provides a number of generally-useful routines (including handling of matrices, vectors and quaternions, 3D objects, Fast Fourier Transforms, string handling); structure file handling (including NMR formats); molecular mechanics (with force fields including AMBER, CHARMM and MMFF94); NMR spectra; QSAR; solvation; rotamers; docking.
    
Bio++
C++ Dutheil et al., 2006 Not maintained. Updated 2014
  • protein sequences
  • DNA sequences
  • phylogeny
  • evolution/population genetics
Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics including likelihood computation. Programs based on Bio++ include BppSuite ( biopp.univ-montp2.fr/wiki/index.php/BppSuite, home.gna.org/bppsuite/, a suite of ready-to-use programs for phylogenetic and sequence analysis), Bpp Phyview ( bppphyview.sourcearchive.com/, a Qt based visual interface for of trees) and TestNH ( biopp.univ-montp2.fr/forge/testnh, a package for testing non-homogeneous processes in sequence evolution.
    
BioJava
Java Prlic et al., 2012 Maintained. Updated 2015
  • protein sequences
  • structure
  • CATH/SCOP
  • structure alignment
  • sequence alignment
  • external programs
structure module which handles input and output of standard PDB and mmCIF files, structural alignment using CE and FATCAT algorithms. It provides a number of routines for handling geometry, symmetry and structure validation as well interfacing to JMol. It can also interface with BLAST, CATH and SCOP.
    
BioPerl
Perl Stajich et al., 2002 Maintained. Updated 2015
  • protein sequences
  • DNA sequences
  • phylogeny
  • taxonomy
  • external programs
  • references (PubMed, etc)
  • OMIM
  • contig analysis
  • sequence retrieval
Bioperl is an extensive project, but with limited support for protein structure. Modules for handling PDB structures which implement a PDB parser and data structure, but provide almost no routines for manipulating or analyzing the data. Bioperl can execute analyses and process results from programs such as BLAST, ClustalW, or the EMBOSS suite. Bioperl provides access to data stores such as GenBank and SwissProt as well as the storage format of the Open Bioinformatics Database Access project.
    
BioPython
Python Cock et al., 2009; Not maintained. Updated 2014
  • protein sequences
  • DNA sequences
  • structure
  • external programs
  • sequence retrieval
Biopython project is a mature project for a wide range of bioinformatics problems. It includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning. The modules for handling PDB structures implement a PDB parser and data structure, but provide almost no routines for manipulating or analyzing the data. A BioPython Google Summer of Code in 2010 added some more PDB handling code ( biopython.org/wiki/GSOC2010_Joao), but this is not in the main distribution version. Details of a PDB parser are described in a paper by Hamelryck and Manderick (2003), www.ncbi.nlm.nih.gov/pubmed/14630660).
    
BioRuby
Ruby Goto et al., 2010 Not maintained. Updated 2013
  • protein sequences
  • DNA sequences
  • structure
  • pathway analysis
  • external programs
  • protein modelling
  • phylogeny
  • sequence retrieval
  • references (PubMed, etc)
BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis. It supports various data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. It provides a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser. Provides a data structure for PDB data and a number of functions for finding records, but little for analysis.
    
Blast Toolkit
C  Not maintained. Updated 2013
  • sequence retrieval
  • sequence searching
  • sequence alignment
There is no publication that describes the toolbox itself, but an online book is available at www.ncbi.nlm.nih.gov/toolkit/doc/book/ or in PDF format at www.ncbi.nlm.nih.gov/toolkit/doc/book/pdf/TOC.pdf.
The code provides a number of general purpose routines including: networking and interprocess communication (IPC); multi-threading; CGI/Fast-CGI; HTML Generation; BerkeleyDB and SQL database access; IOSTREAMs; GZIP/BZ2 compression; ASN.1 and XML serialization; date and time; file system access; portable graphics libraries; XML parsing and handling.
Bioinformatics routines include: sequence alignment; the BLAST engine; sequence retrieval and processing;
    
BTL
C++ Pitt et al., 2001 Not maintained. Updated 2005
  • structure
  • image processing/manipulation
Provides classes for reading PIR sequence files, PDB files, raw co-ordinate files and PPM image files; matrix and vector classes for handling coordinates; graph classes for handling vertices and labelled edges with iteration functions; sorting; calculation and extraction of properties associated with amino acids; random numbers; number comparison.
    
DSR-PDB
C++  Not maintained. Updated 2005
  • structure
A simple C++ PDB reader along with a couple of programs which use it to manipulate pdb files (applying a rigid transform or splitting/merging). Allows easy access to the geometry and bond structure in addition of the biological information. The reader has two modes for reading/writing a pdb file. The simplest one, through the Protein class just reads and writes a single protein from/to a pdb file (which must have only one chain, but can have multiple models). The second, through the PDB class can handle pdb files with multiple models. Once a PDB is read, atom coordinates can be extracted, proteins can be aligned, and cRMS and dRMS can be computed, among other things.
    
EMBOSS
C Rice et al., 2000 Not maintained. Updated 2013
  • protein sequences
  • sequence alignment
  • enzyme kinetics
  • DNA sequences
  • phylogeny
  • sequence searching
  • trans-membrane prediction
This is a very extensive library and set of tools which focuses on sequence, but with the STRUCTURE add-on package ( emboss.sourceforge.net/apps/release/6.6/embassy/structure/ and emboss.sourceforge.net/apps/release/6.6/emboss/apps/protein_3d_structure_group.html) provides a small number of tools for manipulating protein structure, but these are very limited in scope and flexibility.
    
ESBTL
C++ Loriot et al., 2010 Not maintained. Updated 2013
  • structure
Easy Structural Biology Template Library (ESBTL) is a lightweight C++ library that allows the handling of PDB data and provides a data structure suitable for geometric constructions and analyses. The parser and data model provided by this ready-to-use include-only library allows adequate treatment of usually discarded information (insertion code, atom occupancy, etc.) while still being able to detect badly formatted files. The template-based structure allows rapid design of new computational structural biology applications and is fully compatible with the new remediated PDB archive format. It also allows the code to be easy-to-use while being versatile enough to allow advanced user developments.
    
GeCo++
C++ Cereda et al., 2011 Not maintained. Updated 2012
  • DNA sequences
  • genomes/genomics
  • genome annotation
Designed for DNA sequence analysis where genomic annotations and variations need to be considered. Links annotations of genomic elements with sequences and algorithm results. Memory and time overheads have been minimized.
    
Gemmi
C++(Python and FORTRAN2003)  Maintained. Updated 2018
  • structure
Gemmi is a 'next generation' macromolecular coordinate library, funded by Global Phasing and CCP4 as a replacement for MMDB in applications such as Refmac, COOT and BUSTER. It currently has Python bindings with FORTRAN2003+ bindings soon to be available and C and JavaScript bindings planned.
    
GenomeTools
C(Python and Ruby) Gremme et al., 2013 Maintained. Updated 2015
  • genomes/genomics
  • genome annotation
A software library and associated software tools for creating, processing or convert annotation graphs optimized for handling even the largest annotation sets, such as a complete catalogue of human variations. Designed to allows convenient extension and integration into larger workflows.
    
libcov
C++ Butt et al., 2005 Not maintained
  • phylogeny
  • structure
  • sequence alignment
Provides basic handling of protein structure, and protein sequence alignments, but concentrates on phylogeny using the maximum likelihood methods. Methods are provided to read PDB files, FASTA and PHYLIP seuqence files and NEWICK trees. There are a number of methods for tree manipulation, phylogeny, maximum likelihood, sequence simulation. Methods for handling protein structure include geometirc transformation and distance/contact matrices. While the link to the software is broken, the lab's web site is at web.cs.dal.ca/~cblouin/labblouin/.
    
LibSequence
C++ Thornton, 2003 Maintained. Updated 2015
  • evolution/population genetics
  • DNA sequences
  • genomes/genomics
The library implements software for genomics and sequence polymorphism analysis including methods for data manipulation and calculation of statistics commonly used in analysis of SNP data. See the author's web site at www.molpopgen.org.
    
MAT (Macromolecular Analysis Toolkit)
C++  Maintained. Updated 2016
  • structure
Designed as an efficient parser for PDB files, it concentrates on the information in the REMARKs extracting information from the almost free text. PDB files a 'remodelled' to add useful information and remove redundancy. Applications are provided to analyze the structural and functional aspects of the biological structures. Further details at mat.iitr.ac.in/documentation.html.
    
MMDB
C++(FORTRAN and Python)  Not maintained. Updated 2013
  • structure
MMDB is a macromolecular coordinate library, supporting CCP4 application such as REFMAC and COOT. The 'CCP4 Coordinate Library Project' www.ebi.ac.uk/pdbe/docs/cldoc/ has built on this to provide an RWBROOK compatible interface in FORTRAN and C.
    
MolTalk
Smalltalk Diemand and Scheib, 2004  
  • structure
MolTalk is an elaborate programming language, which consists of the programming library libmoltalk implemented in Objective-C and the Smalltalk-based interpreter MolTalk. MolTalk combines the advantages of an easy to learn and programmable procedural scripting with the flexibility and power of a full programming language. An overview of currently available applications of MolTalk is given and with PDBChainSaw one such application is described in more detail. PDBChainSaw is a MolTalk-based parser and information extraction utility of PDB files. Weekly updates of the PDB are synchronised with PDBChainSaw and are available for free download from the MolTalk project page www.moltalk.org following the link to PDBChainSaw. For each chain in a protein structure, PDBChainSaw extracts the sequence from its co-ordinates and provides additional information from the PDB-file header section, such as scientific organism, compound name, and EC code.
    
OpenStructure
C++(Python) Biasini et al., 2013 Maintained. Updated 2015
  • structure
  • protein sequences
  • workflows
  • rendering
  • image processing/manipulation
The OpenStructure software framework is design to allow the seamless integration of information of different origins. It provides a graphics module for interactive display of molecular structures and density maps in three dimensions. Supports mmCIF from V1.4. It handles protein, nucleic acids and ligands, density maps (and image processing), sequence data, batch processing to automate workflows and a graphical user interface to help develop algorithms as well as structure rendering. At the time of writing, V1.4 can only be obtained via git and is downloaded as described at www.openstructure.org/docs/1.4/install/#getting-the-source-code.
    
PDBlib (Arkq)
C  Not maintained. Updated 2011
  • structure
PDBlib provides basic functionality for accessing PDB files and supports the PQR format (extended PDB file for storing charges and radii of particular atoms). A couple of programs exploiting the library are also provided. No paper appears to have been published.
    
PDBlib (Chang)
C++ Chang et al., 1994 Not maintained. Updated 1998
  • structure
PDBlib is an extensible object-oriented library for representing the 3D structure of biological macromolecules and comes with two sample applications: PDBtool for structure verification and PDBview for structure rendering. There are four categories of classes: (i) classes that model the macromolecule; (ii) classes that enhance the extensibility of the library; (iii) classes that provide navigation facilities of the object-oriented macromolecular structure representation; and (iv) a class that loads a PDB file into the memory-resident object-oriented representation.
    
PDBlib.py
Python  Not maintained. Updated 2014
  • structure
A library to handle PDB and PDBx/mmCIF files. This is part of PBxplore ( github.com/pierrepo/PBxplore), a suite of tools for 'Protein Block' analysis. The library itself doesn't have a publication but is used in a paper by de Breverm et al., 2000 ( www.ncbi.nlm.nih.gov/pubmed/11025540)
    
Protein Library (PL)
C++  Not maintained. Updated 2007
  • structure
  • stucture optimization
Designed as part of OPS (the Open Protein Simulator), for performing protein folding simulations. The 'Protein Library' part of the distribution is available from sourceforge.net/projects/protlib/files/pl/, but hasn't been updated since 2007 despite other parts having been updated in 2013. A small number of tools is also provided at sourceforge.net/projects/protlib/files/pl-tools/. The code provides routines for reading PDB data and optimizing structures by energy minimization. No citation is available for the library, but a list of publications that exploit the code is available at protlib.uchicago.edu/pubs.html.
    
RWBROOK
FORTRAN   
  • structure
RWBROOK is part of the CCP4 crystallography project. From CCP4 5.0, co-ordinates are handled by MMDB which provides Fortran bindings that are backwardly-compatible with the previous Fortran-only version of rwbrook.
    
Sleipnir
C++ Huttenhower et al., 2008 Maintained. Updated 2015
  • genomes/genomics
  • microarrays
  • machine learning
Sleipnir focuses on genomes, microarrays and machine learning. It provides routines and tools for microarray processing (imputing, combining, clustering, calculating normalized correlations, etc.), exploration of functional catalogues, Bayesian data integration, graph generation and identifying cliques, rapid data mining.
    
Victor
C++ Hirsh et al., 2015 Not maintained. Updated 2014
  • protein sequences
  • structure
  • statistical potentials
  • sequence alignment
  • loop modelling
The VIrtual Constrution TOol for pRoteins (Victor) C++ library is designed to be easy to use. Application examples cover statistical energy potentials, profile-profile sequence alignments and ab initio loop modeling.