KabatMan -------- Version 2.25 24th August, 2006 Copyright (c) 1994-2006 Dr. Andrew C.R. Martin / UCL / Reading, Biomolecular Structure & Modelling Unit, Department of Biochemistry and Molecular Biology, University College, Gower Street, London. WC1E 6BT. Document created : 27th April, 1994 1. Introduction =============== KabatMan is a simple database program for accessing the Kabat antibody sequence database. Kabat, et al. have, for some years, published collected antibody sequence data in the form of a book, but as the number of sequences solved increases, the need for electronic distribution becomes more and more necessary. The Kabat data have now been made available as `dump' files by FTP from the following site: ncbi.nlm.nih.gov:repository/kabat/dump As of V2.0, KabatMan now reads the new format introduced on 2nd April and first made available during June, 1994. The old format files may still be read by using the -o flag. In January 1995, the new format changed somewhat and V2.4 reads this new format (and not the June 1994 format). The new format files are available from: ftp://ncbi.nlm.nih.gov/repository/kabat/fixlen or ftp://ftp.ebi.ac.uk/pub/databases/kabat/fixlen The format of these data files, while easy to read by eye, is somewhat difficult to scan by computer. Currently, no software is supplied to derive data from the files, though it is the intention of the Kabat database maintainers to provide a more computer-friendly environment in the future. KabatMan (``Kabat Manager'') has been written to fill the present gap. It assembles the antibody data from the Kabat dump files and allows searches to be performed on the data. Standard PIR sequence files may be created and other information (such as CDR lengths) may be obtained while allowing many constraints to be placed on the selection of sequences. 2. Command Syntax ================= KabatMan is normally invoked simply with the command `kabatman'. Interaction is then interactive. Three options allow additional control over the program: kabatman [-version] [-f] [-v[v...]] [-q] [-o] (Square brackets indicate optional items; you don't type them!) The -version (or --version) flag causes KabatMan to print its copyright/version message and exit. The -f flag forces the reading of new raw Kabat data files even if the kabat.dat file exists (see Section 3) The -v flag sets the information level to 2 or more (default is 1). Each additional v increments the information level by one. A level of 2 causes the program to display the name of each sequence as it is stored. The -q flag sets the information level to 0 (default is 1). This prevents the program from displaying the names of Kabat sequence files as they are processed. The -o flag causes the program to read old Kabat format files. If you do not have the data stored in the current directory, you should first define the environment variable KABATDIR to point to the directory containing the data. For example, using a C-shell: setenv KABATDIR /usr/user/martin/kabat or using a Bourne-like shell: export KABATDIR=/usr/user/martin/kabat Whenever a file is opened, the program will look first in the current directory and then in the directory specified by KABATDIR. To leave the program, type QUIT or EXIT at the prompt. 3. The Data =========== Input data for KabatMan are the Kabat dump files. Either the old format or the new format (available after June 1994) may be used. By default, the new format is read. The new format files unlike the old format dump files are not separated into different species, protein type and chain types. It is necessary to split the files up first in order for KabatMan to make associations between appropriate light and heavy chains. The program SplitKabat performs this function; it will read the new dump files and write them out into separate files with names like human.ig.hc (as used for the old format dump files). Any non-antibody sequences (T-cell receptors, etc.) will be rejected during this procedure. As of the 1995 release of the Kabat database, each entry in the database comes in a separate file. The database is supplied as a set of compressed tar files which unpack into a set of directories: 000, 001, 002, 003, etc. containing files named by their KADBID entry code (000001, 000002, etc.) The directory names reflect the first 3 digits of the KADBID code. All files must be merged into a single datafile for use by KabatMan. This is achieved using a shell script of the form: touch kabat.jan95.dat for dir in 0* do cat $dir/* >>kabat.jan95.dat done This must be run from the directory containing all the Kabat sub-directories (000, 001, 002, etc.) and concatenates the data into a file called kabat.jan95.dat (You should change the date as required). You should ensure that there are no other files or directory starting with the digit 0 and, when the database increases to include KADBID entries starting with a 1, 2, 3, etc rather than just a 0 the script should be suitably modified. In addition, the Kabat database may be distributed with a `temp' directory containing incremental additions and corrections to the database. Again this is distributed as a compressed tar file (temp.tar.Z) which unpacks to create a subdirectory called temp. These files should be concatenated into a single file, kabat.patch, which is applied to the main kabat.jan95.dat file created above using the program PatchKabat. This program may also be used to delete entries from the main kabat.jan95.dat file. The KADBID numbers of entries to be deleted should be placed into a file, one to a line. PatchKabat is then run using: patchkabat [-d delfile] [-p patchfile] kabatdatabasefile >patchedfile -d Specify file of deletions (entry numbers, one to a line) -p Specify patch file Either (or both) -d or -p must be specified. For example, to apply the additions and corrections in file kabat.patch to the database kabat.jan95.dat and delete the entries in kabat.del, you would use the command line: patchkabat -d kabat.del -p kabat.patch kabat.jan95.dat >kabat.jan95.p1.dat Having prepared and merged the data, it must be split into separate files for specifes and chain using splitkabat. N.B. In common with KabatMan V2.4 itself, splitkabat V1.2 has been modified to handle the 1995 Kabat database release. The program is run by typing: splitkabat kabat1 This is repeated for kabat2, kabat3, etc. Data from each of the Kabat files will be appended to the appropriate class file. Also requried is a `file of files' which lists the Kabat dump files to be read. The file of files has any number of lines of the format: { heavy | - } [light [light [light] ] ] where heavy and light are names of matching heavy-chain and light-chain Kabat dump files. (Square brackets indicate optional items; you don't type them!) If a light-chain dump file has no matching heavy chain dump file, the heavy chain file must be specified as `-'. The following example file may be used for the Kabat dump files available at the time of writing: cat.ig.hc chicken.ig.hc chicken.ig.lambda dog.ig.hc frog.ig.hc gopher.ig.hc - horse.ig.lambda human.ig.hc human.ig.kappa human.ig.lambda mouse.ig.hc mouse.ig.kappa mouse.ig.lambda rabbit.ig.hc rabbit.ig.kappa rabbit.ig.lambda rat.ig.hc rat.ig.kappa rat.ig.lambda shark.ig.hc - sheep.ig.lambda various.ig.hc various.ig.kappa various.ig.lambda When the program is run by typing the command `kabatman', the data contained in these files are processed, with correlations being made between heavy and light chains of the same name and the resulting data are written to the file `kabat.dat' in the current directory. If the file `kabat.dat' exists in the current directory, or in the directory described by the environment variable KABATDIR, the file of files and the Kabat dump files will be ignored. The -f flag on the command line forces reading of the file of files and the generation of `kabat.dat' from the Kabat dump files. For the Chothia canonical information to be available, there must also be a canonicals definition file in the current directory or in the directory described by the environment variable KABATDIR. This file must be called chothia.dat and has the format: LOOP ... Where is L1...H3 is the class label (1, 2, 1', etc) is the loop length (AbM definition) is a residue specification (e.g. L25) is a list of allowed amino acid types (e.g. SYF) Residue numbering is Kabat standard unless the keyword CHOTHIANUMBERING appears at the beginning of the file (before any LOOP keywords). By convention, class names not defined by Chothia (i.e. additional classes defined on the basis of new crystal structures) are followed by a * (e.g. Class 9* for CDR-L3). Comments may be introduced with an exclamation mark (!) or a hash (#). These characters must appear in the first column of the file. The file may also contain one SOURCE record per loop record. This should follow the LOOP record and contains free-format text describing the definition of the canonical. This information is not used by the KabatMan program. 4. The Database Model ===================== The database uses a simple functional data model (FDM). As well as allowing data stored directly in the database tables to be accessed, an FDM database allows derived data to be accessed. This reduces the necessary storage size allowing information to be calculated `on the fly'. For example, the CDRs of the antibodies may be extracted using the KabatMan database. They are not stored explicitly as separate columns in the database as would be necessary using a normal flat file or relational database. Instead a function is stored which allows the CDRs to be derived from the sequence data stored for the whole antibody Fv. Similarly, the lengths of the CDRs are not stored directly in the database tables; they are calculated using a function in the database. Clauses in the WHERE statement which filter the data are combined in reverse polish notation. Selections resulting from each clause are placed on a stack and the logical operators act on this stack to combine the results. No attempt is made to optimise the searches (i.e. successive components which are ANDed together only need to check the items found in the previous hit list, but this is not done). Although this incurs a time penalty, the current database takes no more than a few seconds to search and the lack of optimisation simplifies the code. 5. The Query Language ===================== Data in the database is accessed using a SQL-like query language. The main deviation between the language used here and SQL is that clauses within the WHERE statement are combined in reverse polish notation. This is described in detail below. Also, since there is only one table in the database, there is no FROM statement. The query language uses three statements: SET, SELECT and WHERE. The SELECT and WHERE commands take you into a mode where clauses are specified with no introductionary SELECT or WHERE keyword; the prompt changes from `KABATMAN>' to indicate the current mode. Conversely, the SET command needs to be specified on each line where variables are to be set and may only be given at the `KABATMAN>' prompt. The SELECT and WHERE statements remain valid until a new statement is given. This allows multiple SELECT statements (each followed by a search execution) to obtain different data from a single WHERE statement thus saving on typing. Similarly, a number of WHERE statements may be tried to obtain the same data under different search conditions. The QUIT or EXIT command is used to leave the program. 5.1 The SET Statement --------------------- This allows variables to be set. These are used to control the operation of the program and options set using this keyword remain valid until the program is terminated or another SET command is given. The SET keyword must be given at the start of each line containing variables to be set and may only be given at the `KABATMAN>' prompt. More than one variable/value pair may be specified on the line. The syntax is thus: SET variable value [variable value ...] (Square brackets indicate optional items; you don't type them!) The following variables may be set: Variable Values Notes -------- ------ ----- LOOPS KABAT, ABM, CHOTHIA, Sets the definitions to use for CONTACT the CDRs INFO integer Sets the information level for messages INSERTS ON, OFF Specifies whether CDRs should be displayed with or without alignment dashes VARIABILITY real (0...100) Specifies a sequence identity threshold (see below). HTML ON, OFF Adds some HTML codes to the output to make the results look nicer over the web URL string Specify the URL CANONICAL string Specify the canonical class definition file DELIMiter chararacter Specify the field delimiter character The VARIABILITY variable allows one to specify that only sequences with less than a specified sequence homology be displayed. After all WHERE sub-clauses have been executed, KabatMan will do a pairwise comparison of all the remaining antibodies and reject one from each pair which has a percentage sequence identity greater than this value. For each sequence mismatch, a penalty of 100/MeanLength % is scored; missing chains do not score a penalty; deletions in one sequence with respect to the other are given a double penalty (i.e. 200/MeanLength). By default, VARIABILITY is set to 0.0 which is used to indicate `off' (this has the same effect as setting it to 100, but the program will run faster since the variability code will not be called). N.B. This option is SLOW if there are more than a couple of hundred hits to be processed. For example, with 1124 hits (using the query SELECT light WHERE complete = t;), takes 4 minutes on a 100MHz R4000 SG Indy. The option to specify the URL allows you to specfiy a different URL string to be used when selecting the URLL or URLH fields. You may also give the command SET URL DEFAULT to restore the default URL. If you use this command, you must specify the format in the correct fashion. It must start with href= and end with the HTML GET method query. The actual accession code and anchor tags will be appended to the string. The default string is: href=http://immuno.bme.nwu.edu/scripts/noninter.tcl?qt= If this doesn't mean anything to you, then DO NOT use this option! The SET CANONICAL command allows the canonical definitions (key residue) file to be specified. The default file (which is read on startup and may be re-read by giving the command SET CANONICAL DEFAULT) is called `chothia.dat'. Any string which you specify with the SET CANONICAL command (other than DEFAULT) will be appended with a '.' onto the name `chothia.dat' and this file will be read. For example, specifying SET CANONICAL STRICT will cause the program to read definitions from a file called `chothia.dat.strict' The SET DEMILITER command is fairly self-explanatory. The default is that output fields are separated by a comma and a space. However, some of the names contain commas so one can change to any other character (e.g. |) as a delimiter. A space is automatically appended to the delimiter. 5.2 The SELECT Statement ------------------------ Specifies what data should be extracted from the database. Each field may be separated by a space and/or a comma. Parameters are placed in parentheses. In the case of the PIR command, the parameter is optional; if not specified, output is to the standard output. The select statement may extend over more than one line. The following fields may be displayed: Property Notes -------- ------------------------- NAME Name of antibody ANTIGEN Antigen (or blank string) L1 Sequence of CDR L1 L2 Sequence of CDR L2 L3 Sequence of CDR L3 H1 Sequence of CDR H1 H2 Sequence of CDR H2 H3 Sequence of CDR H3 CLASS Light chain class (derived from file or filename) Note that there may also be heavy chain information here (e.g. IGG-KAPPA) SOURCE Animal source (derived from filename) REFerence Publication reference LENgth(loop) Length of specified CDR RESidue(pos) Amino acid at specified residue PIRfile[(file)] Create a PIR file (of given name if specified) LIGHT The light chain sequence HEAVY The heavy chain sequence CANonical(loop) The Chothia canonical class for a loop IDLight The accession code for the light chain IDHeavy The accession code for the heavy chain URLLight An HTML hypertext link to the light chain raw data URLHeavy An HTML hypertext link to the heavy chain raw data SUBGroup(chain) The subgroup (currently only for human sequences) DATe The year of the earliest reference LFR1 Sequence of light framework 1 LFR2 Sequence of light framework 2 LFR3 Sequence of light framework 3 LFR4 Sequence of light framework 4 HFR1 Sequence of heavy framework 1 HFR2 Sequence of heavy framework 2 HFR3 Sequence of heavy framework 3 HFR4 Sequence of heavy framework 4 SEQuence[(file)] Create a numbered sequence file (of given name if specified) (The required parts of field names are in capital letters) 5.3 The WHERE Statement ----------------------- Specifies filters to be applied to the data in order to select a subset which is to be displayed. Each clause is specified in conventional order notation using the form: property test value The following properties are supported: Property Type Notes -------- ---- ----- NAME string The antibody name ANTIGEN string The antigen (if specified) L1 string Sequence of CDR L1 L2 string Sequence of CDR L2 L3 string Sequence of CDR L3 H1 string Sequence of CDR H1 H2 string Sequence of CDR H2 H3 string Sequence of CDR H3 CLASS string Light chain class (derived from file or name) Note that there may also be heavy chain information here (e.g. IGG-KAPPA) SOURCE string Animal source (derived from filename) REFerence string Publication reference LENgth(loop) integer Length of specified CDR RESidue(pos) char Amino acid at specified position COMPLETE bool True of both chains are present VARiability integer Not implemented LIGHT string The light chain sequence HEAVY string The heavy chain sequence CANonical(loop) string Chothia class name or ? if unmatched IDLight nnnnnn The accession code for the light chain IDHeavy nnnnnn The accession code for the heavy chain SUBGroup(chain) string The subgroup (currently only for human sequences) DATe The year of the earliest reference LFR1 Sequence of light framework 1 LFR2 Sequence of light framework 2 LFR3 Sequence of light framework 3 LFR4 Sequence of light framework 4 HFR1 Sequence of heavy framework 1 HFR2 Sequence of heavy framework 2 HFR3 Sequence of heavy framework 3 HFR4 Sequence of heavy framework 4 (The required parts of field names are in capital letters) nnnnnn represents a 6-digit accession code. The following tests may be made: Test Property type Notes ---- ------------- ----- =, ==, eq all Exact match (NOTE 1) !=, <>, ne all Mismatch <, lt integer Less than <=, le integer Less than or equal >, gt integer Greater than >=, ge integer Greater than or equal cont,like,sim,inc,sub string Sub-string match (NOTES 2,3) NOTE 1: The exact match is not recommended for use with strings; it is much safer to use the substring match. NOTE 2: These test words are all equivalent and are the minimum string required. Thus `contains' may be used in place of `cont'; `includes' in place of `inc', etc. NOTE 3: Sub-string matches on the CDRs, LIGHT and HEAVY properties ignore dashes in the sequence allowing one to search for portions of sequence while ignoring gaps left for insertions. String values do not need to be placed in inverted commas unless they include white space or inverted commas. Thus one can use the string "HYHEL-5'CL" to include a single inverted comma or 'a "name"' to include double inverted commas. Blank strings may be specified using '' or "". You are recommended NOT to use exact matches for strings; use the sub-string match operator. Sub-string matches may contain ? wildcard characters to match single characters. This is particularly useful for sequence matches, so you can, for example, search for 10 residue L1s containing S?S??????Y. Boolean values are entered as `true' or `false' (only the first letter, t or f, is actually considered). All comparisons ignore case. WHERE clauses are combined using Reverse Polish Notation. Each clause generates a set which matches the specified test and places the set on a stack. The logical operators AND, OR and NOT cause operations to be performed on this stack of sets. Stack Operator Stack Notes ----- -------- ----- ----- 1 2 AND 1 Performs a logical AND 1 2 OR 1 Performs a logical OR 1 NOT 1 Performs a logical NOT The stack has a maximum depth of 10 items. This should be more than enough for all likely queries providing that the logical operators are not all placed at the end of the statement. (Should the stack depth prove insufficient, the constant STACKDEPTH in kabatman.h may be increased at compile time.) 5.4 Running The Search ---------------------- The search is run by typing a . or ; on a line by itself. These cause output to be directed to the screen. Alternatively, output my be sent to a file by typing a greater than sign followed by a filename (a space may optionally be placed between the > and the filename). Output from the search will then be sent to this file rather than to standard output. Note that the PIR option in the SELECT statement also allows a file to be specified for output. 5.5 Examples ------------ 1. Find all complete antibodies where the antigen is known with loop lengths: SELECT name,antigen,length(l1),length(l2),length(l3), length(h1),length(h2),length(h3) WHERE antigen != '' complete = true and ; 2. Get the sequences of all complete mouse antibodies which bind to lysozyme, placing the results in a file AbLysozyme.pir: SELECT pir(AbLysozyme.pir) WHERE source includes mouse antigen includes lysozyme AND complete = true AND ; 3. Find all antibodies with 11 residue CDR-L1s and a proline at the sixth position: SELECT name, l1 WHERE len(l1) = 11 res(L29) = P and ; 4. Find all complete antibodies with the sequence Ser-Ala-Ser-Ser-Ser in the light chain. (Note that there must be no spaces in the sequence: SELECT name, light WHERE complete = t light includes SASSS AND ; 5. Find all Kappa-III light chains SELECT name WHERE light <> '' class includes 'kappa' and subgroup(L) = 'III' ; 6. Loop Definitions =================== The SET LOOPS command allows one of the following four sets of loop definitions to be selected: Loop Kabat AbM Chothia Contact ---- ----- --- ------- ------- L1 L24 -- L34 L24 -- L34 L24 -- L34 L30 -- L36 L2 L50 -- L56 L50 -- L56 L50 -- L56 L46 -- L55 L3 L89 -- L97 L89 -- L97 L89 -- L97 L89 -- L96 H1 H31 -- H35B H26 -- H35B H26 -- H32 H30 -- H35B H2 H50 -- H65 H50 -- H58 H52 -- H56 H47 -- H58 H3 H95 -- H102 H95 -- H102 H95 -- H102 H93 -- H101 7. Standard Kabat Numbering =========================== Light chain: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 27A 27B 27C 27D 27E 27F 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 95A 95B 95C 95D 95E 95F 96 97 98 99 100 101 102 103 104 105 106 106A 107 108 109 Heavy chain: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 35A 35B 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 52A 52B 52C 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 82A 82B 82C 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 100A 100B 100C 100D 100E 100F 100G 100H 100I 100J 100K 101 102 103 104 105 106 107 108 109 110 111 112 113 8. Deviations from the Standard Numbering ========================================= The Kabat standard numbering does not specify any naming convention for insertions which occur in addition to the supplied insertion lettering. This is particularly evident with very long CDR-H2 loops. The Kabat database may, for example, specify that additional residues occur between H100G and H100H, but does not specify how these should be named. KabatMan takes a simpler approach and just adds letters sequentially throughout such an insertion region. This may lead to differences in the naming of residues in such very long insertion regions. 9. Acknowledgement ================== I should like to thank the maintainers of the Kabat sequence database for their assistance in making the data available and in listening to our suggestions about the new format. In particular, my thanks go to George Johnson. 10. Revision History ==================== V0.1 12.04.94 Development version V1.0 27.04.94 Original release version V1.1 11.05.94 Some internal changes for distribution purposes and fixed bug in command line parsing. V1.2 11.05.94 Added new function for finding Chothia canonical classes. Blanks in the sequence are now read as ? rather than - V2.0 30.06.94 Reads the new Kabat file format by default. V2.1 11.07.94 Additional check made when matching light & heavy chains; at least one author's name must match. V2.2 21.07.94 New Kabat file reader takes strain information from ANNOTA STRN records. Now only takes the source from the filename if the filename is not `VARIOUS....'. Taking this information from the filename is not really required with the new format files. Also reads and compares all references rather than just the first. V2.3 25.01.95 Added the SET VARIABILITY code. V2.4 08.02.95 Modified for the January 1995 Kabat database format V2.5 07.03.95 The Kabat reading code now places - instead of ? at the end of a sequence. This stops all the sequences which are truncated in CDR3 from giving maximum length CDR3s V2.6 13.03.95 kabatman -version gives version information and exits. Fortran style comparisons (ne, eq, lt, etc.) now available. Fixed small bug in handling PIR option when a file is not specified. ---- 31.03.95 Documentation changes only. V2.7 13.05.95 The Chothia data file may now contain SOURCE records which are ignored by KabatMan and comments introduced by a # as well as a ! --- 16.06.95 Documentation changes only. V2.8 22.06.95 Fixed bug when full chain sequences given in a WHERE clause V2.9 23.06.95 Clean compiling under gcc -Wall V2.10 27.06.95 Fixed bug in fuzzy string matching --- 27.11.95 Documentation changes only. V2.11 15.12.95 Introduced single character wildcard matches --- 26.02.96 Documentation changes only. V2.12 02.04.96 Added handling for IDs and URLs V2.13 11.04.96 Added header lines to data file and datafile date Prints accession codes of skipped entries when reading Kabat data files. Allow --version as well as -version V2.14 18.04.96 Added SET URL V2.15 22.04.96 Fixed potential access violation in reading canonical data file. Fixed initialisation of URL string V2.16 07.05.96 Chothia data file can now use Chothia or Kabat numbering V2.17 29.05.96 Added SET CANONICAL V2.18 10.09.97 Added SUBGROUP(chain) (Chain is either L or H) V2.19 14.10.98 Added SET DELIMITER V2.20 28.05.99 Fixed core-dump bug in EatInitials() V2.21 13.07.00 Fixed long-standing bug preventing string equality comparison when a string contained a ' or " V2.22 31.07.00 Added Contact loop definitions V2.23 03.04.02 Added DATE V2.24 28.02.05 Added LFR1...HFR4 V2.25 24.08.06 Added SEQUENCE