ANALYSIS OF DNA AND PROTEIN
SEQUENCES USING NCBI RESOURCES
http://webspace.ship.edu/wjpatr/cell_web_lab2.html
DNA fragments
obtained by cloning specific genes, or high through-put next generation
sequencing data from gene expression analysis or metagenomic studies, need to
be identified by analysis of the obtained sequences. If it is a coding
sequence, i.e., if it encodes a polypeptide, you will also wish determine
structure/function characteristics of the encoded protein. There are fortunately on-line databases that
are freely accessible for such analyses.
Many of you have used the NCBI (National Center for Biotechnology
Information) “GenBank” database in previous courses. Over the years, NCBI has become a vast and
highly integrated set of tools and databases for genomes, transcriptomes,
protein sequences and structures, with links to the literature and external tools
and resources. In this exercise, you
will use the DNA sequence below as the basis for a number of
exercises related to database searches and analyses on the web, as well as to
help reinforce concepts related to gene expression, protein structure. To
start, you need to use the mouse to select the text of the sequence
below. Once it is highlighted, push Ctrl + C to copy the text on to the
clipboard. You will then be able to paste the sequence into various
programs on the web. Be sure to keep this page open, opening
links in new tabs or windows.
NOTE 1: By convention, DNA sequences are published as
single strands written left to right, from the 5' end to the 3' end. However, genomes of cells are
double-stranded, and both strands get used as templates for
transcription. Accordingly, both strands must be analyzed when searching
for coding regions.
CCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCGGCCGAGGCGG
CCGGAGTCCCGAGCTAGCCCCGGCGGCCGCCGCCGCCCAGACCGGACGACAGGCCACCTCGTCGGCGTCC
GCCCGAGTCCCCGCCTCGCCGCCAACGCCACAACCACCGCGCACGGCCCCCTGACTCCGTCCAGTATTGA
TCGGGAGAGCCGGAGCGAGCTCTTCGGGGAGCAGCGATGCGACCCTCCGGGACGGCCGGGGCAGCGCTCC
TGGCGCTGCTGGCTGCGCTCTGCCCGGCGAGTCGGGCTCTGGAGGAAAAGAAAGTTTGCCAAGGCACGAG
TAACAAGCTCACGCAGTTGGGCACTTTTGAAGATCATTTTCTCAGCCTCCAGAGGATGTTCAATAACTGT
GAGGTGGTCCTTGGGAATTTGGAAATTACCTATGTGCAGAGGAATTATGATCTTTCCTTCTTAAAGACCA
TCCAGGAGGTGGCTGGTTATGTCCTCATTGCCCTCAACACAGTGGAGCGAATTCCTTTGGAAAACCTGCA
GATCATCAGAGGAAATATGTACTACGAAAATTCCTATGCCTTAGCAGTCTTATCTAACTATGATGCAAAT
AAAACCGGACTGAAGGAGCTGCCCATGAGAAATTTACAGGAAATCCTGCATGGCGCCGTGCGGTTCAGCA
ACAACCCTGCCCTGTGCAACGTGGAGAGCATCCAGTGGCGGGACATAGTCAGCAGTGACTTTCTCAGCAA
CATGTCGATGGACTTCCAGAACCACCTGGGCAGCTGCCAAAAGTGTGATCCAAGCTGTCCCAATGGGAGC
TGCTGGGGTGCAGGAGAGGAGAACTGCCAGAAACTGACCAAAATCATCTGTGCCCAGCAGTGCTCCGGGC
GCTGCCGTGGCAAGTCCCCCAGTGACTGCTGCCACAACCAGTGTGCTGCAGGCTGCACAGGCCCCCGGGA
GAGCGACTGCCTGGTCTGCCGCAAATTCCGAGACGAAGCCACGTGCAAGGACACCTGCCCCCCACTCATG
CTCTACAACCCCACCACGTACCAGATGGATGTGAACCCCGAGGGCAAATACAGCTTTGGTGCCACCTGCG
TGAAGAAGTGTCCCCGTAATTATGTGGTGACAGATCACGGCTCGTGCGTCCGAGCCTGTGGGGCCGACAG
CTATGAGATGGAGGAAGACGGCGTCCGCAAGTGTAAGAAGTGCGAAGGGCCTTGCCGCAAAGTGTGTAAC
GGAATAGGTATTGGTGAATTTAAAGACTCACTCTCCATAAATGCTACGAATATTAAACACTTCAAAAACT
GCACCTCCATCAGTGGCGATCTCCACATCCTGCCGGTGGCATTTAGGGGTGACTCCTTCACACATACTCC
TCCTCTGGATCCACAGGAACTGGATATTCTGAAAACCGTAAAGGAAATCACAGGGTTTTTGCTGATTCAG
GCTTGGCCTGAAAACAGGACGGACCTCCATGCCTTTGAGAACCTAGAAATCATACGCGGCAGGACCAAGC
AACATGGTCAGTTTTCTCTTGCAGTCGTCAGCCTGAACATAACATCCTTGGGATTACGCTCCCTCAAGGA
GATAAGTGATGGAGATGTGATAATTTCAGGAAACAAAAATTTGTGCTATGCAAATACAATAAACTGGAAA
AAACTGTTTGGGACCTCCGGTCAGAAAACCAAAATTATAAGCAACAGAGGTGAAAACAGCTGCAAGGCCA
CAGGCCAGGTCTGCCATGCCTTGTGCTCCCCCGAGGGCTGCTGGGGCCCGGAGCCCAGGGACTGCGTCTC
TTGCCGGAATGTCAGCCGAGGCAGGGAATGCGTGGACAAGTGCAACCTTCTGGAGGGTGAGCCAAGGGAG
TTTGTGGAGAACTCTGAGTGCATACAGTGCCACCCAGAGTGCCTGCCTCAGGCCATGAACATCACCTGCA
CAGGACGGGGACCAGACAACTGTATCCAGTGTGCCCACTACATTGACGGCCCCCACTGCGTCAAGACCTG
CCCGGCAGGAGTCATGGGAGAAAACAACACCCTGGTCTGGAAGTACGCAGACGCCGGCCATGTGTGCCAC
CTGTGCCATCCAAACTGCACCTACGGATGCACTGGGCCAGGTCTTGAAGGCTGTCCAACGAATGGGCCTA
AGATCCCGTCCATCGCCACTGGGATGGTGGGGGCCCTCCTCTTGCTGCTGGTGGTGGCCCTGGGGATCGG
CCTCTTCATGCGAAGGCGCCACATCGTTCGGAAGCGCACGCTGCGGAGGCTGCTGCAGGAGAGGGAGCTT
GTGGAGCCTCTTACACCCAGTGGAGAAGCTCCCAACCAAGCTCTCTTGAGGATCTTGAAGGAAACTGAAT
TCAAAAAGATCAAAGTGCTGGGCTCCGGTGCGTTCGGCACGGTGTATAAGGGACTCTGGATCCCAGAAGG
TGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAA
ATCCTCGATGAAGCCTACGTGATGGCCAGCGTGGACAACCCCCACGTGTGCCGCCTGCTGGGCATCTGCC
TCACCTCCACCGTGCAGCTCATCACGCAGCTCATGCCCTTCGGCTGCCTCCTGGACTATGTCCGGGAACA
CAAAGACAATATTGGCTCCCAGTACCTGCTCAACTGGTGTGTGCAGATCGCAAAGGGCATGAACTACTTG
GAGGACCGTCGCTTGGTGCACCGCGACCTGGCAGCCAGGAACGTACTGGTGAAAACACCGCAGCATGTCA
AGATCACAGATTTTGGGCTGGCCAAACTGCTGGGTGCGGAAGAGAAAGAATACCATGCAGAAGGAGGCAA
AGTGCCTATCAAGTGGATGGCATTGGAATCAATTTTACACAGAATCTATACCCACCAGAGTGATGTCTGG
AGCTACGGGGTGACCGTTTGGGAGTTGATGACCTTTGGATCCAAGCCATATGACGGAATCCCTGCCAGCG
AGATCTCCTCCATCCTGGAGAAAGGAGAACGCCTCCCTCAGCCACCCATATGTACCATCGATGTCTACAT
GATCATGGTCAAGTGCTGGATGATAGACGCAGATAGTCGCCCAAAGTTCCGTGAGTTGATCATCGAATTC
TCCAAAATGGCCCGAGACCCCCAGCGCTACCTTGTCATTCAGGGGGATGAAAGAATGCATTTGCCAAGTC
CTACAGACTCCAACTTCTACCGTGCCCTGATGGATGAAGAAGACATGGACGACGTGGTGGATGCCGACGA
GTACCTCATCCCACAGCAGGGCTTCTTCAGCAGCCCCTCCACGTCACGGACTCCCCTCCTGAGCTCTCTG
AGTGCAACCAGCAACAATTCCACCGTGGCTTGCATTGATAGAAATGGGCTGCAAAGCTGTCCCATCAAGG
AAGACAGCTTCTTGCAGCGATACAGCTCAGACCCCACAGGCGCCTTGACTGAGGACAGCATAGACGACAC
CTTCCTCCCAGTGCCTGAATACATAAACCAGTCCGTTCCCAAAAGGCCCGCTGGCTCTGTGCAGAATCCT
GTCTATCACAATCAGCCTCTGAACCCCGCGCCCAGCAGAGACCCACACTACCAGGACCCCCACAGCACTG
CAGTGGGCAACCCCGAGTATCTCAACACTGTCCAGCCCACCTGTGTCAACAGCACATTCGACAGCCCTGC
CCACTGGGCCCAGAAAGGCAGCCACCAAATTAGCCTGGACAACCCTGACTACCAGCAGGACTTCTTTCCC
AAGGAAGCCAAGCCAAATGGCATCTTTAAGGGCTCCACAGCTGAAAATGCAGAATACCTAAGGGTCGCGC
CACAAAGCAGTGAATTTATTGGAGCATGACCACGGAGGATAGTATGAGCCCTAAAAATCCAGACTCTTTC
GATACCCAGGACCAAGCCACAGCAGGTCCTCCATCCCAACAGCCATGCCCGCATTAGCTCTTAGACCCAC
AGACTGGTTTTGCAACGTTTACACCGACTAGCCAGGAAGTACTTCCACCTCGGGCACATTTTGGGAAGTT
GCATTCCTTTGTCTTCAAACTGTGAAGCATTTACAGAAACGCATCCAGCAAGAATATTGTCCCTTTGAGC
AGAAATTTATCTTTCAAAGAGGTATATTTGAAAAAAAAAAAAAGTATATGTGAGGATTTTTATTGATTGG
GGATCTTGGAGTTTTTCATTGTCGCTATTGATTTTTACTTCAATGGGCTCTTCCAACAAGGAAGAAGCTT
GCTGGTAGCACTTGCTACCCTGAGTTCATCCAGGCCCAACTGTGAGCAAGGAGCACAAGCCACAAGTCTT
CCAGAGGATGCTTGATTCCAGTGGTTCTGCTTCAAGGCTTCCACTGCAAAACACTAAAGATCCAAGAAGG
CCTTCATGGCCCCAGCAGGCCGGATCGGTACTGTATCAAGTCATGGCAGGTACAGTAGGATAAGCCACTC
TGTCCCTTCCTGGGCAAAGAAGAAACGGAGGGGATGGAATTCTTCCTTAGACTTACTTTTGTAAAAATGT
CCCCACGGTACTTACTCCCCACTGATGGACCAGTGGTTTCCAGTCATGAGCGTTAGACTGACTTGTTTGT
CTTCCATTCCATTGTTTTGAAACTCAGTATGCTGCCCCTGTCTTGCTGTCATGAAATCAGCAAGAGAGGA
TGACACATCAAATAATAACTCGGATTCCAGCCCACATTGGATTCATCAGCATTTGGACCAATAGCCCACA
GCTGAGAATGTGGAATACCTAAGGATAGCACCGCTTTTGTTCTCGCAAAAACGTATCTCCTAATTTGAGG
CTCAGATGAAATGCATCAGGTCCTTTGGGGCATAGATCAGAAGACTACAAAAATGAAGCTGCTCTGAAAT
CTCCTTTAGCCATCACCCCAACCCCCCAAAATTAGTTTGTGTTACTTATGGAAGATAGTTTTCTCCTTTT
ACTTCACTTCAAAAGCTTTTTACTCAAAGAGTATATGTTCCCTCCAGGTCAGCTGCCCCCAAACCCCCTC
CTTACGCTTTGTCACACAAAAAGTGTCTCTGCCTTGAGTCATCTATTCAAGCACTTACAGCTCTGGCCAC
AACAGGGCATTTTACAGGTGCGAATGACAGTAGCATTATGAGTAGTGTGGAATTCAGGTAGTAAATATGA
AACTAGGGTTTGAAATTGATAATGCTTTCACAACATTTGCAGATGTTTTAGAAGGAAAAAAGTTCCTTCC
TAAAATAATTTCTCTACAATTGGAAGATTGGAAGATTCAGCTAGTTAGGAGCCCACCTTTTTTCCTAATC
TGTGTGTGCCCTGTAACCTGACTGGTTAACAGCAGTCCTTTGTAAACAGTGTTTTAAACTCTCCTAGTCA
ATATCCACCCCATCCAATTTATCAAGGAAGAAATGGTTCAGAAAATATTTTCAGCCTACAGTTATGTTCA
GTCACACACACATACAAAATGTTCCTTTTGCTTTTAAAGTAATTTTTGACTCCCAGATCAGTCAGAGCCC
CTACAGCATTGTTAAGAAAGTATTTGATTTTTGTCTCAATGAAAATAAAACTATATTCATTTCCACTCTA
AAAAAAAAAAAAAAAA
Once
you have selected and copied the text, you will use some tools maintained at
the
NCBI
has links to many tools including search engines for sequences and structures,
and PubMed, which allows access to millions of citations, abstracts
and links to the journal articles.
I. Open
When a
DNA sequence is obtained, we frequently want to know if the sequence codes for
a protein. To do this, we look for start codons (ATG) on each strand of
the DNA, and then read both strands of the sequence 5' to 3' to see if triplet
codons continue in frame. If they continue, they constitute an open
reading frame (ORF), that potentially could be a protein-coding
sequence.
Connect
to the Open Reading
Frame (ORF) Finder program link. (From the NCBI homepage,
click on the “Analyze” heading in the
middle of the page, and click on the "All Tools" list. ORF
Finder will be just past midway down the page, listed alphabetically.)
To run the program paste (Ctrl + v) the sequence into the large data input window that’s labeled "sequence in FASTA format". Click on the "Submit" button below the window. Within a minute or two an ORF viewer window will appear on top showing the ORFs as red lines with little arrows indicating the direction of the ORF (right arrow for the + strand.
Below, on
the right will be the ORFs listed in order of their length in nucleotides and
amino acids. Usually, the longest ORF is the true one. The list
will also indicate if the sense strand for the ORF is the + strand (left to
right as written) or the - strand (the complement, reading right to left), and
show the start and stop positions on the nucleotide strand. Below left is a window showing the sequence
of the selected ORF. The “Display ORF as” above the window allows
selection of protein only, nucleotide only, or both (CDS translation).
Click
on the longest blue ORF (reading frame +1) and examine the sequence that opens
on the left in CDS translation view.
NOTE 2: Below the data input window of OrfFinder is a menu to choose the genetic code. The genetic code for nuclear-encoded genes is
“standard” for plants and animals, but it varies somewhat for mitochondrial
genes and the nuclear-encoded genes of some protists. Alternative start codons, such as GUG in
bacteria, can also be taken into account. These codon variants are not extreme, and a
common ancestry of all existent genetic codes is still a reasonable hypothesis.
Q1. How long is the protein (number
of amino acids)? At which base pair (number in the DNA sequence)
does the ORF start? At which base pair (number in the DNA
sequence) does the ORF end? What stop codon is used? What
are the first and last amino acids in the sequence? Note some of this
information is summarized in the table to the right of the sequence.
Now click on the third largest ORF (ORF32) in the -2
reading frame and examine the sequence.
Q2. How long is the protein? At which base pair
(number in the DNA sequence) does the ORF start? At which base pair
(number in the DNA sequence) does the ORF end? What stop codon is
used? What are the first and last amino acids in the sequence? Why
does the DNA sequence count down rather than up as in the case
above? (sequence viewer on top of the
ORF Finder page may help)
Q3. Explain why OrfFinder needs
to analyze six different reading frames (+1,+2,+3 and
-1, -2, -3). Refer to your answers in
the previous question to help answer this one.
(Additional hints: DNA is double-stranded, a triplet
genetic code).
II. Search
for and view similar protein sequences using Protein BLAST*
OrfFinder results can be directly linked to BLAST
searches. BLAST (Basic Local
Alignment Search Tool) is a program that searches sequences at GenBank for
matches to an input sequence. There are a number of types of BLAST searches, including searches that
look for matching nucleotide sequences, protein sequences, and protein sequences
to DNA sequences "translated" by the BLAST program, as well as
specialized programs to design PCR primers for a nucleic acid sequence of
interest.
Since
we are working with a translated DNA sequence (i.e., a polypeptide sequence)
from OrfFinder results, we will use the default
protein BLAST (blastp) from the site which matches your input protein sequence
(the translated ORF) with a "translated" coding gene database. The BLAST algorithm looks for matches between
your “query” sequence and subject sequences and creates
alignments between them.
*NOTE 3: Sequence databases are very large, and are increasing at a high rate. The BLAST button from OrfFinder uses the smallest database, SwissProt, as a default. The other choices include RefSeq protein, a GenBank database filtered to contain only verified results, and “nr”, an abbreviation for non-redundant, which a misnomer since it contains combined, multiple databases. Results from your initial BLAST using default settings with the RefSeq database will give results covering a large range of more related sequences and will serve to identify your protein. We will later make use of limits that will allow us to explore some specific relationships that are not otherwise obvious.
A.
Initial BLAST – identify the sequence
To run
BLAST on your protein sequence, again select ORF1. Choose
the protein reference sequence (RefSeq)database (not the
default SwissProt) and click on the BLAST button
below the sequence window (NOT the Smart BLAST). This will open a new
BLAST entry page. Keep the default parameters for blastp. [*Double-check
database on entry page]; Select "show results in a new
window" and click on the BLAST button near the bottom of the page.
The program may take a few minutes to run, with the results in a new tab.
Don’t close the BLAST entry page- you will rerun it with different parameters
to facilitate finding information regarding relationships to homologous
sequences in other organisms and in humans.
*Format
for the BLAST output has recently changed, mostly for the better. The results are split up into multiple tabs,
highlighted in the screenshot below. The tab will initially open to a tab of
Descriptions. The Descriptions will give you a list of matches (proteins in the
database), listed in order of total score. Also listed is query cover (the % of
amino acids in the query that aligned, and the E value.
The Descriptions is a list of “hits”, the best matching at the
top. The list gives the name of the
protein, and often the organism. You
will also see score and E values. The
higher the score, the better the match; the smaller the E value, the less likely
the match is just by chance. The E values are particularly dependent on
the length of your sequence and the database size.
Q4. Look list of the matches. What is the name of
the protein? In what organism is the
best match found? (Use the top several
listings which have the best match).
Note that the best match is 100% identity. What are the next several organisms? Taxonomy
details will be examined later, but you may know these well known primates; but if not, click on the
scientific name.
Q5. Describe what the E value signifies.* If your query sequence was quite
short, how would the E value be affected in a BLAST search? Why?
*NOTE 4: You can get
more information about E values and the BLAST algorithms by opening the links
at the top of the page in a new window for videos and for how to read this
report. There are 2 short and good
videos on E values (the first one provides sufficient background).***as of 10/11/22 above link does work. Below is an
alternative link to the part one. https://www.youtube.com/watch?v=ZN3RrXAe0uM
Do not close the tab/window with these BLAST results - you will be
using the links from it for the next several questions.
B. Examine
alignments
To view pairwise alignments
of the sequences and to identify differences, click the check box top left of
the list to select all of the sequences (there are
times you will limit the alignments to simplify and speed up your analysis).
Click on the Alignments tab. Just under the tabs, change alignment view to “pairwise
with dots for identities”, then scroll down to examine some of the
sequences. The query is the input
sequence (top line) and the subject is the match found
in the database. The dots indicate the identical amino acid. Differences are indicated by red letters.
Q6. Examine the alignment
with the first match. Are all the amino
acids identical between the query and the subject (identified as NP_005219.2)?
Q7.
Examine the second alignment, which is for Pan troglodytes (chimpanzee)(XP_519102.3). How many amino acids differ between the two sequences?
Would you expect substituting an Ile residue with Leu, or a Ser with a Thr, to have a major effect on the overall protein
structure and function? Why or why
not? Note at the top of the alignment,
the number of “identities” and the number of “positives” is shown. Which of the substitutions in the alignment
is not a “positive”? Why? Is the chimpanzee sequence as second best match surprising?
If you scroll down the
description list, what other organisms do you recognize? Unless you just took Vert Zoo, probably not
many!
C. Examine taxonomic
Relationships
1.
Lists based on homology
To examine taxonomic/evolutionary relationships of your "hits" more conveniently,
results can be displayed as taxonomy reports and as tree diagrams. Click the
Taxonomy tab and examine the default Lineage view.
Taxonomy
reports produces a table-like listing with the broadest taxonomic categories
toward the left, and narrower groupings, like primates and rodents further to
right, all under the broadest category umbrella of Eutheria
– placental mammals. Organisms listed furthest to the left are usually less
taxonomically related to our human sequence. If you click on Organism,
rather than Lineage, the hits are grouped by organism and includes the common
name- handy if you haven't run into a Loxodonta africana
recently (you would probably get hurt if you did).
Q8. What
is the name given for the protein for all the matches? What taxonomic group, as
a whole, has the
sequences most similar to human? What is one of the least human-like organism (taxonomically) listed with an orthologous
sequence? Would you say that this protein was evolutionarily
conserved? Why or why not?
Q9. How many human matches do you find (list by
organism)? How are these different
sequences distinguished in their names?
Note that this type terminology (isoform)is typical for normal protein
variants encoded by one gene, NOT different genes. How do you think these isoforms arise?
2. Tree Diagrams*
Distance
Tree of Results creates a cladogram-like tree diagram grouping "hits"
by their sequence similarities, into various branches. The distances between
the branches (length of the lines), and between the leaves on the branches, are
proportional to their differences in sequence. The branches themselves
are largely organized by taxonomy. The link for the distance tree is just above
the Graphic Summary tab. Clicking on the link (make sure all sequences are
still selected) will open the tree in a new tab.
Click on the link for Distance Tree of
Results. From the tools menu, try
the 4 different views: rectangle view, slant view, radial and force. Make
use of the color coding shown to the right to help you find different taxonomic
groups. I personally find "slant view" or "rectangular
view" to be more easily interpreted. Only the sequences with a
relatively high match will be presented. Controls on the right will allow
you to alter the view. The default is a collapsed view, which is
sufficient for this part of our exercise. To see detailed labels for
everything, choose “expand all” from the tools, and click on TXT to optimize
the view. You can likewise change the sequence label setting between sequence
title and taxonomic name to more easily identify sequences
and organisms. If the tree is too big to get the "big picture”,
download the tree as a pdf file, an option from the tools
menu. Viewing it in Adobe reader or a separate browser tab may be easier for
you.
Q10. The query sequence
is highlighted in yellow in the tree. What else is grouped with it on the
branch?
NOTE 5: The trees created using the pairwise alignments
of BLAST are not used for publication grade phylogenetic trees. The BLAST-based
trees are good for a first approximation.
Publication grade trees are derived from an alignment of all the
sequences together (a multiple alignment) instead of just pairwise
alignments. A free and reasonably
easy-to-use program called MEGA (currently MEGA 11) is available for Windows
and MacOS for anyone interested in creating sequence-based phylogenetic trees (https://www.megasoftware.net/).
D. Conserved functional units of 3D structure -
Conserved Domains
Select Graphic Summary from the BLAST results page. On top, a map showing detected conserved domains in the polypeptide will be shown. Domains are functional substructures within an overall polypeptide structure. Conserved domain search algorithms are automatically run with protein BLASTs and are actually faster than the alignment determination. Find information regarding the functions of these domains by opening the image as a link.
If you want to see the 3D structures, you can download
the stand-alone program Cn3D for PC or Mac and use the menu to select
Style>Rendering Shortcuts>
Q11. Use the
descriptions and links on the page to describe the characteristics and roles of
the following regions. In addition,
write down the range of amino acid residues corresponding to these regions in
your query sequence. Finally, indicate whether you think these domains
are likely to be intracellular or extracellular, based on their
functions. For the Receptor L domain describe the secondary structure.
PTKc-EGFR
domain
Transmembrane domain
Furin-like repeats
Receptor L
domain
Growth factor receptor domain IV
NOTE 6: Domain descriptions and assignments do change
over time. The usual numbering of the
EGFR domains from N terminus to C terminus includes I – first L domain, II
first furin-like repeat, III – second L domain, and
IV – the second furin-like repeat – but sometimes
counted as 2 furin-like repeat domains; recently described is the growth
factor receptor IV domain linking furin-like repeats
to L-domains.
III. Reference Sequence (RefSeq) - an integrated and
reliable sequence database
NCBI hosts a large number of sequence databases. The RefSeq databases for proteins and nucleic
acid sequences are among the most reliable.
They are reviewed and annotated sequence entries that are summarized
compilations of multiple entries. In contrast, the nr (“non-redundant”)
database used in the default BLAST is in fact highly redundant and contains
many partial sequences, essentially identical sequences, and sometimes
misidentified sequences. Other
databases, such as environmental or “metagenomic” samples are even more
redundant and unfiltered. We will use
the RefSeq entry for your protein sequence as a gateway to more information
regarding the protein. Most RefSeq
accession numbers are preceded with NP_,XP_(proteins)
NR_,NC_(genome sequences) or NM_,XM_(RNA/cDNA).
Back
to your BLAST results - Open the link for the entry (at or near the
top), ref|NP_005219.2|, in a new window to examine the GenBank entry. (If you have lost your BLAST results page you
can paste the accession number NP_005219.2 into an NCBI search and open
the protein link). Once you have opened
this page, you can close the previous BLAST pages. You will want to keep this page open, since
it will be used to access additional links.
As you scroll down the page you will see links to related PubMed entries,
mostly primary source journal articles.
To the side you will see many links: tools including BLAST; related
sequences; 3D structures; resources such as OMIM, Gene, and Conserved Domain
Database (CDD).
Q12. Read the
summary just below the list of PubMed entries.
What specific location in a mammalian cell will this protein be
found? What does this receptor protein
bind? What does activation of this
receptor protein cause in cells? With what disease have mutations been
associated?
As you continue to scroll down, you will find a detailed list of
structural features of the polypeptide sequence (located near the bottom of the
page). The list includes regions such as
the signal sequence, domains, disulfide bonds, glycosylation sites, and
phosphorylation sites. These
descriptions run, roughly, from N to C terminus, from top to bottom. Clicking the feature link will highlight the
corresponding sequence and a pop-up with some details.
“sig-peptide”
– role in translating proteins of endomembrane system?
“mat-peptide” – what happened to the
signal peptide?
“Region” 634..674 – how would you describe the
properties of amino acids 650-668 in this region? Why would is this to be expected?
“Region” 704..1016 – would you expect this region to be intracellular or extracellular?
“Site” 998 – how does this relate to the region described above?
IV. Back to nucleic acids - How most protein
sequences are determined
Go back to the top of the protein reference
sequence entry and click the link for DBSOURCE REFSEQ: accession NM_005228.5
near the top of the page.
This is the actual source of the amino acid sequence, translated using a
program similar to ORF finder. It is also a
Reference Sequence (RefSeq), which means it is a reviewed and highly annotated
entry.
In the top portion of the entry
you will find a summary of information.
Q14.
What is the name (definition) of the entry?
Q15. How many basepairs is the reported nucleotide sequence?
Q16. Why does it say mRNA
at the top of the entry? Does the sequence at the bottom of the page read
like RNA or DNA? Hint: Note the 3' end of the sequence at the top of
this page that you used for ORF Finder.
That sequence corresponds to NM_005228.3,
an earlier
version of the RefSeq sequence.
Q17. For most eukaryotic gene
sequences, in contrast to mRNA sequences such as the one for EGFR that
we analyzed, OrfFinder is of limited use: a long,
continuous open reading frame such as the one we obtained with the mRNA
sequence is extremely rare using actual genomic DNA sequences. WHY?
To obtain sequences of mRNA, mRNA is
typically used as a template to synthesize cDNA (complementary DNA)using a reverse transcriptase. The cDNA can be cloned and sequenced, or
sequenced directly using newer high throughput techniques (referred to as RNASeq).
NOTE 7: RefSeq database entries are crucial for
reasonably assured sequence comparisons using BLAST and other tools,and have numerous links to
related information. An even more
comprehensive derivative database for the more common organisms is the Gene database. Although we are not making use of it in this
exercise, (there are links from both the protein and nucleic acid RefSeq pages)
it is very useful as a “one-stop-shop” for information on specific genes and
their corresponding transcripts and proteins.
Modifying
BLAST searches for more focused queries
A. Are EGFR homologs present in all types of cells,
or just “higher” multicellular organisms?
Our initial BLAST
search provided a direct answer to the question- What is the identity of our
protein sequence? We also found that the
protein was found in a wide variety of placental mammals. Examination of some of the roles reported for
the gene product suggest significance in regulation cell division leading us to
pose the following questions: Are EGFR
homologs present in all types of cells, or just “higher” multicellular
organisms? How about plants, fungi,
unicellular protists, and bacteria?
Q18. What are your hypotheses regarding these organisms? What is your rationale for your hypotheses – in other words, why do you think they are reasonable?
To test these hypotheses, blast the protein sequence, limiting the search to only reference sequences database as before. Next, limit Organism using some taxonomic limits to filter out the highly similar sequences in closely related organisms. Frequently typing in the common name of the group (like sponges) will bring up appropriate suggestions.
Shortcut for this part to avoid ORF Finder steps: Go to the NCBI BLAST homepage https://blast.ncbi.nlm.nih.gov/Blast.cgi and select Protein BLAST. Enter the accession number for human EGFR (NP_005219.2) into the query entry box.
Try the following 4 organism limits,
plus at least two more of your choice.
If you use new tabs for each search, you can run and view them all
simultaneously for comparison. The taxonomy reports and/or
distance trees may help your interpretation.
If your organismal
biology is a little rusty (like mine is), look up some of the scientific
names. You can often get more
information by opening the link to the sequence and opening the taxonomy link. This can also be done from the taxonomy
reports link.
exclude mammalia
include only
arthropods (if you want fruit flies specifically, narrow the organism
limit)
include only
bacteria - (Use Bacteria taxid:2; this will include characterized eubacteria; the taxid:77133 is uncultured environmental
samples, not included in the reference sequence database)
include only
sponges (Porifera)
Q19. For each search, note the extent (query cover %) and regions of homology; in other words, is the entire sequence homologous, or just a specific domain or region? This readily done using the graphic summary tab and comparing the alignment score figure summaries with the domain map above it. What range of positive matches in the amino acid alignments (in %)do you find for each group? What specific organisms in each of the searches are familiar to you? Which are least related to humans? How do the overall patterns of homology compare to standard phylogenetic/evolutionary trees of organisms? Do any particular regions/domains of the protein appear to be more conserved or evolutionarily more ancient?
B. Are there additional
genes similar to EGFR in humans?
Mammalian development and cell division are complex, highly regulated processes. Is it possible that genes similar to EGFR evolved and function in the human genome? To test this possibility, modify your BLAST search by limiting the search to only human reference sequences, "Homo sapiens" in the Organism dataset limit. You will want to view these results in the distance tree view as you did before, rectangular or slant view for clarity, select in tools “expand all”, and click on TXT to optimize the view.
Q20.
Some the sequences are semi-redundant, showing precursor sequences and splicing
variants (isoforms in the tree). These highly related sequences are
clustered into discrete branches - most with multiple isoforms listed as a,b,c, or by Roman numerals, with
distinct gene names most obvious (in my opinion) in the rectanglar
format. An additional note to help you make sense of this- EGFR also
goes by the name of erbB-1.
List the
names of the 4 distinct genes.
These sequences have a special type of homology referred to as paralogous. How might paralogous sequences evolve?
Note
8: When this exercise was first designed
in 2001, the amount of sequence data was significantly less, and our distant
relations (including fruit flies and fish) and paralogous sequences were
readily found in the initial BLAST results from OrfFinder
without modification of the BLAST options.
Optimizing BLAST to address specific questions is a useful skill, and we
have touched upon some of these tricks.
The data explosion has also necessitated development of derivative
databases such as Gene that can summarize the essentials and link to the
details, and we’ll make use of some of these later this semester.