The sequencing of the human genome and the emerging intense interest in proteomics and molecular structure have caused an enormous explosion in the need for biological databases. The first half of this course surveys a wide range of biological databases and their access tools and seeks to develop proficiency in their use. These include genome and sequence databases such as GenBank and Ensemble, as well as protein databases such as PDB and SWISSPROT, and their analysis tools. Tools for accessing and manipulating sequence databases will be covered, such as BLAST, multiple alignment, Perl, and gene finding tools. Advanced, specialized and recent popular databases such as KEGG, BioCyc, HapMap, Allen Brain Atlas, Afcs, etc., will be surveyed for their design and use. The second half of this course focuses on the design of biological databases including the computational methods to create the underlying data, as well as the special requirements of biological databases such as interoperability, complex data structures consisting of very long strings, object orientation, efficient interaction with computational operators, parallel and distributed storage, secure transactions and fast recall. Students will create their own small database as a project for the course as well as complete homework assignments using databases. NOTE: Students who do not have a prior background in databases can succeed in this course by concurrent self-study of relational databases and SQL using a book such as “Database Solutions: A Step by Step Guide to Building Databases”, by Thomas Connolly and Carolyn Begg, Addison-Wesley. Some of this material will be reviewed in class.
Prerequisites: 605.441 Principles of Database Systems or 410.634 or working knowledge of SQL, and a prior course in molecular biology or cell biology (605.205 or 410.602).