Data for:
Pandey A; Braun EL. The roles of protein structure, taxon sampling, and model complexity 
in phylogenomics: A case study focused on early animal divergences.

This folder contains three subfolders:

################################################################################
01_annotated_NEXUS_files

This folder include Nexus files with a multiple sequence alignment of each protein with
an ASSUMPTIONS block that includes the following charsets:

CHARSET SS_HELIX = list of sites... ;
CHARSET SS_SHEET = list of sites... ;
CHARSET SS_COIL  = list of sites... ;
CHARSET EXPOSED  = list of sites... ;
CHARSET BURIED  = list of sites...;
CHARSET INSIDE  = list of sites...;
CHARSET OUTSIDE  = list of sites...;
CHARSET MEMBRANE  = list of sites...;

Each file also includes a PAUP block that will export the relevant subsets of the data if 
the file is executed in PAUP* (available from http://paup.phylosolutions.com). Note that
the PAUP block ends with the command "quit;"

This directory includes the globular proteins and the transmembrane proteins. Each type of
protein has the relevant structural information; the remaining charsets are left empty.

A shell script called "00_remove_paupblock.sh" that will remove the PAUP block. To remove
the PAUP block(s) from individual files simply run the script with a list of filenames to
be processed. For example, the command:

./00_remove_paupblock.sh gene1.nex gene2.nex

Will output two files that have the PAUP blocks removed:

gene1.nex.nopaup.nxs
gene2.nex.nopaup.nxs

The script can be used to remove all PAUP blocks as follows:

./00_remove_paupblock.sh gene*.nex

################################################################################
02_concatenated_data_files

Five nexus files with concatenated data for each structurally-defined data type. Each file
has a SETS block with the name of each gene. The filenames and the sizes of the data 
matrices are:

BURIED.nex		dimensions ntax=97 nchar=194117;
EXPOSED.nex		dimensions ntax=97 nchar=161897;
SS_COIL.nex		dimensions ntax=97 nchar=134334;
SS_HELIX.nex	dimensions ntax=97 nchar=161117;
SS_SHEET.nex	dimensions ntax=97 nchar=60563;

################################################################################
03_GTR_rate_matrices

Square (21 rows x 20 columns) GTR rate matrices for each structural environment. The first 
20 rows correspond to the symmetric exchangeability matrix (R) and the final row is the
amino acid frequencies. The order of amino acids is the standard order for programs such
as PAML (i.e., A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V).

The files are named:

BUR.dat		rate matrix for buried sites
EXP.dat		rate matrix for solvent exposed sites

HELIX.dat	rate matrix for alpha helix sites
SHEET.dat	rate matrix for sheet sites
COIL.dat	rate matrix for coil sites

################################################################################
