SEQUENCE ANALYSIS

ANALYSIS OF DNA AND PROTEIN SEQUENCES USING NCBI RESOURCES

http://webspace.ship.edu/wjpatr/cell_web_lab2.html

DNA fragments obtained by cloning specific genes, or high through-put next generation sequencing data from gene expression analysis or metagenomic studies, need to be identified by analysis of the obtained sequences. If it is a coding sequence, i.e., if it encodes a polypeptide, you will also wish determine structure/function characteristics of the encoded protein. There are fortunately on-line databases that are freely accessible for such analyses. Many of you have used the NCBI (National Center for Biotechnology Information) “GenBank” database in previous courses. Over the years, NCBI has become a vast and highly integrated set of tools and databases for genomes, transcriptomes, protein sequences and structures, with links to the literature and external tools and resources. In this exercise, you will use the DNA sequence below as the basis for a number of exercises related to database searches and analyses on the web, as well as to help reinforce concepts related to gene expression, protein structure. To start, you need to use the mouse to select the text of the sequence below. Once it is highlighted, push Ctrl + C to copy the text on to the clipboard. You will then be able to paste the sequence into various programs on the web. Be sure to keep this page open, opening links in new tabs or windows.

NOTE 1: By convention, DNA sequences are published as single strands written left to right, from the 5' end to the 3' end. However, genomes of cells are double-stranded, and both strands get used as templates for transcription. Accordingly, both strands must be analyzed when searching for coding regions.

CCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCGGCCGAGGCGG

CCGGAGTCCCGAGCTAGCCCCGGCGGCCGCCGCCGCCCAGACCGGACGACAGGCCACCTCGTCGGCGTCC

GCCCGAGTCCCCGCCTCGCCGCCAACGCCACAACCACCGCGCACGGCCCCCTGACTCCGTCCAGTATTGA

TCGGGAGAGCCGGAGCGAGCTCTTCGGGGAGCAGCGATGCGACCCTCCGGGACGGCCGGGGCAGCGCTCC

TGGCGCTGCTGGCTGCGCTCTGCCCGGCGAGTCGGGCTCTGGAGGAAAAGAAAGTTTGCCAAGGCACGAG

TAACAAGCTCACGCAGTTGGGCACTTTTGAAGATCATTTTCTCAGCCTCCAGAGGATGTTCAATAACTGT

GAGGTGGTCCTTGGGAATTTGGAAATTACCTATGTGCAGAGGAATTATGATCTTTCCTTCTTAAAGACCA

TCCAGGAGGTGGCTGGTTATGTCCTCATTGCCCTCAACACAGTGGAGCGAATTCCTTTGGAAAACCTGCA

GATCATCAGAGGAAATATGTACTACGAAAATTCCTATGCCTTAGCAGTCTTATCTAACTATGATGCAAAT

AAAACCGGACTGAAGGAGCTGCCCATGAGAAATTTACAGGAAATCCTGCATGGCGCCGTGCGGTTCAGCA

ACAACCCTGCCCTGTGCAACGTGGAGAGCATCCAGTGGCGGGACATAGTCAGCAGTGACTTTCTCAGCAA

CATGTCGATGGACTTCCAGAACCACCTGGGCAGCTGCCAAAAGTGTGATCCAAGCTGTCCCAATGGGAGC

TGCTGGGGTGCAGGAGAGGAGAACTGCCAGAAACTGACCAAAATCATCTGTGCCCAGCAGTGCTCCGGGC

GCTGCCGTGGCAAGTCCCCCAGTGACTGCTGCCACAACCAGTGTGCTGCAGGCTGCACAGGCCCCCGGGA

GAGCGACTGCCTGGTCTGCCGCAAATTCCGAGACGAAGCCACGTGCAAGGACACCTGCCCCCCACTCATG

CTCTACAACCCCACCACGTACCAGATGGATGTGAACCCCGAGGGCAAATACAGCTTTGGTGCCACCTGCG

TGAAGAAGTGTCCCCGTAATTATGTGGTGACAGATCACGGCTCGTGCGTCCGAGCCTGTGGGGCCGACAG

CTATGAGATGGAGGAAGACGGCGTCCGCAAGTGTAAGAAGTGCGAAGGGCCTTGCCGCAAAGTGTGTAAC

GGAATAGGTATTGGTGAATTTAAAGACTCACTCTCCATAAATGCTACGAATATTAAACACTTCAAAAACT

GCACCTCCATCAGTGGCGATCTCCACATCCTGCCGGTGGCATTTAGGGGTGACTCCTTCACACATACTCC

TCCTCTGGATCCACAGGAACTGGATATTCTGAAAACCGTAAAGGAAATCACAGGGTTTTTGCTGATTCAG

GCTTGGCCTGAAAACAGGACGGACCTCCATGCCTTTGAGAACCTAGAAATCATACGCGGCAGGACCAAGC

AACATGGTCAGTTTTCTCTTGCAGTCGTCAGCCTGAACATAACATCCTTGGGATTACGCTCCCTCAAGGA

GATAAGTGATGGAGATGTGATAATTTCAGGAAACAAAAATTTGTGCTATGCAAATACAATAAACTGGAAA

AAACTGTTTGGGACCTCCGGTCAGAAAACCAAAATTATAAGCAACAGAGGTGAAAACAGCTGCAAGGCCA

CAGGCCAGGTCTGCCATGCCTTGTGCTCCCCCGAGGGCTGCTGGGGCCCGGAGCCCAGGGACTGCGTCTC

TTGCCGGAATGTCAGCCGAGGCAGGGAATGCGTGGACAAGTGCAACCTTCTGGAGGGTGAGCCAAGGGAG

TTTGTGGAGAACTCTGAGTGCATACAGTGCCACCCAGAGTGCCTGCCTCAGGCCATGAACATCACCTGCA

CAGGACGGGGACCAGACAACTGTATCCAGTGTGCCCACTACATTGACGGCCCCCACTGCGTCAAGACCTG

CCCGGCAGGAGTCATGGGAGAAAACAACACCCTGGTCTGGAAGTACGCAGACGCCGGCCATGTGTGCCAC

CTGTGCCATCCAAACTGCACCTACGGATGCACTGGGCCAGGTCTTGAAGGCTGTCCAACGAATGGGCCTA

AGATCCCGTCCATCGCCACTGGGATGGTGGGGGCCCTCCTCTTGCTGCTGGTGGTGGCCCTGGGGATCGG

CCTCTTCATGCGAAGGCGCCACATCGTTCGGAAGCGCACGCTGCGGAGGCTGCTGCAGGAGAGGGAGCTT

GTGGAGCCTCTTACACCCAGTGGAGAAGCTCCCAACCAAGCTCTCTTGAGGATCTTGAAGGAAACTGAAT

TCAAAAAGATCAAAGTGCTGGGCTCCGGTGCGTTCGGCACGGTGTATAAGGGACTCTGGATCCCAGAAGG

TGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAA

ATCCTCGATGAAGCCTACGTGATGGCCAGCGTGGACAACCCCCACGTGTGCCGCCTGCTGGGCATCTGCC

TCACCTCCACCGTGCAGCTCATCACGCAGCTCATGCCCTTCGGCTGCCTCCTGGACTATGTCCGGGAACA

CAAAGACAATATTGGCTCCCAGTACCTGCTCAACTGGTGTGTGCAGATCGCAAAGGGCATGAACTACTTG

GAGGACCGTCGCTTGGTGCACCGCGACCTGGCAGCCAGGAACGTACTGGTGAAAACACCGCAGCATGTCA

AGATCACAGATTTTGGGCTGGCCAAACTGCTGGGTGCGGAAGAGAAAGAATACCATGCAGAAGGAGGCAA

AGTGCCTATCAAGTGGATGGCATTGGAATCAATTTTACACAGAATCTATACCCACCAGAGTGATGTCTGG

AGCTACGGGGTGACCGTTTGGGAGTTGATGACCTTTGGATCCAAGCCATATGACGGAATCCCTGCCAGCG

AGATCTCCTCCATCCTGGAGAAAGGAGAACGCCTCCCTCAGCCACCCATATGTACCATCGATGTCTACAT

GATCATGGTCAAGTGCTGGATGATAGACGCAGATAGTCGCCCAAAGTTCCGTGAGTTGATCATCGAATTC

TCCAAAATGGCCCGAGACCCCCAGCGCTACCTTGTCATTCAGGGGGATGAAAGAATGCATTTGCCAAGTC

CTACAGACTCCAACTTCTACCGTGCCCTGATGGATGAAGAAGACATGGACGACGTGGTGGATGCCGACGA

GTACCTCATCCCACAGCAGGGCTTCTTCAGCAGCCCCTCCACGTCACGGACTCCCCTCCTGAGCTCTCTG

AGTGCAACCAGCAACAATTCCACCGTGGCTTGCATTGATAGAAATGGGCTGCAAAGCTGTCCCATCAAGG

AAGACAGCTTCTTGCAGCGATACAGCTCAGACCCCACAGGCGCCTTGACTGAGGACAGCATAGACGACAC

CTTCCTCCCAGTGCCTGAATACATAAACCAGTCCGTTCCCAAAAGGCCCGCTGGCTCTGTGCAGAATCCT

GTCTATCACAATCAGCCTCTGAACCCCGCGCCCAGCAGAGACCCACACTACCAGGACCCCCACAGCACTG

CAGTGGGCAACCCCGAGTATCTCAACACTGTCCAGCCCACCTGTGTCAACAGCACATTCGACAGCCCTGC

CCACTGGGCCCAGAAAGGCAGCCACCAAATTAGCCTGGACAACCCTGACTACCAGCAGGACTTCTTTCCC

AAGGAAGCCAAGCCAAATGGCATCTTTAAGGGCTCCACAGCTGAAAATGCAGAATACCTAAGGGTCGCGC

CACAAAGCAGTGAATTTATTGGAGCATGACCACGGAGGATAGTATGAGCCCTAAAAATCCAGACTCTTTC

GATACCCAGGACCAAGCCACAGCAGGTCCTCCATCCCAACAGCCATGCCCGCATTAGCTCTTAGACCCAC

AGACTGGTTTTGCAACGTTTACACCGACTAGCCAGGAAGTACTTCCACCTCGGGCACATTTTGGGAAGTT

GCATTCCTTTGTCTTCAAACTGTGAAGCATTTACAGAAACGCATCCAGCAAGAATATTGTCCCTTTGAGC

AGAAATTTATCTTTCAAAGAGGTATATTTGAAAAAAAAAAAAAGTATATGTGAGGATTTTTATTGATTGG

GGATCTTGGAGTTTTTCATTGTCGCTATTGATTTTTACTTCAATGGGCTCTTCCAACAAGGAAGAAGCTT

GCTGGTAGCACTTGCTACCCTGAGTTCATCCAGGCCCAACTGTGAGCAAGGAGCACAAGCCACAAGTCTT

CCAGAGGATGCTTGATTCCAGTGGTTCTGCTTCAAGGCTTCCACTGCAAAACACTAAAGATCCAAGAAGG

CCTTCATGGCCCCAGCAGGCCGGATCGGTACTGTATCAAGTCATGGCAGGTACAGTAGGATAAGCCACTC

TGTCCCTTCCTGGGCAAAGAAGAAACGGAGGGGATGGAATTCTTCCTTAGACTTACTTTTGTAAAAATGT

CCCCACGGTACTTACTCCCCACTGATGGACCAGTGGTTTCCAGTCATGAGCGTTAGACTGACTTGTTTGT

CTTCCATTCCATTGTTTTGAAACTCAGTATGCTGCCCCTGTCTTGCTGTCATGAAATCAGCAAGAGAGGA

TGACACATCAAATAATAACTCGGATTCCAGCCCACATTGGATTCATCAGCATTTGGACCAATAGCCCACA

GCTGAGAATGTGGAATACCTAAGGATAGCACCGCTTTTGTTCTCGCAAAAACGTATCTCCTAATTTGAGG

CTCAGATGAAATGCATCAGGTCCTTTGGGGCATAGATCAGAAGACTACAAAAATGAAGCTGCTCTGAAAT

CTCCTTTAGCCATCACCCCAACCCCCCAAAATTAGTTTGTGTTACTTATGGAAGATAGTTTTCTCCTTTT

ACTTCACTTCAAAAGCTTTTTACTCAAAGAGTATATGTTCCCTCCAGGTCAGCTGCCCCCAAACCCCCTC

CTTACGCTTTGTCACACAAAAAGTGTCTCTGCCTTGAGTCATCTATTCAAGCACTTACAGCTCTGGCCAC

AACAGGGCATTTTACAGGTGCGAATGACAGTAGCATTATGAGTAGTGTGGAATTCAGGTAGTAAATATGA

AACTAGGGTTTGAAATTGATAATGCTTTCACAACATTTGCAGATGTTTTAGAAGGAAAAAAGTTCCTTCC

TAAAATAATTTCTCTACAATTGGAAGATTGGAAGATTCAGCTAGTTAGGAGCCCACCTTTTTTCCTAATC

TGTGTGTGCCCTGTAACCTGACTGGTTAACAGCAGTCCTTTGTAAACAGTGTTTTAAACTCTCCTAGTCA

ATATCCACCCCATCCAATTTATCAAGGAAGAAATGGTTCAGAAAATATTTTCAGCCTACAGTTATGTTCA

GTCACACACACATACAAAATGTTCCTTTTGCTTTTAAAGTAATTTTTGACTCCCAGATCAGTCAGAGCCC

CTACAGCATTGTTAAGAAAGTATTTGATTTTTGTCTCAATGAAAATAAAACTATATTCATTTCCACTCTA

AAAAAAAAAAAAAAAA

Once you have selected and copied the text, you will use some tools maintained at the National Center for Biotechnology Information (NCBI)(https://www.ncbi.nlm.nih.gov/) to analyze this sequence. The first program you will use is called ORF Finder that will search for open reading frames (ORFs) in your sequence. The link to ORF Finder can be found in resource links from NCBI homepage, but I provide a direct link below.

NCBI has links to many tools including search engines for sequences and structures, and PubMed, which allows access to millions of citations, abstracts and links to the journal articles.

I. Open Reading Frames (ORFs) and the nature of the genetic code

When a DNA sequence is obtained, we frequently want to know if the sequence codes for a protein. To do this, we look for start codons (ATG) on each strand of the DNA, and then read both strands of the sequence 5' to 3' to see if triplet codons continue in frame. If they continue, they constitute an open reading frame (ORF), that potentially could be a protein-coding sequence.

Connect to theOpen Reading Frame (ORF) Finderprogram link. (From the NCBI homepage, click on the “Analyze” heading in the middle of the page, and click on the "All Tools" list. ORF Finder will be just past midway down the page, listed alphabetically.)

To run the program paste (Ctrl + v) the sequence into the large data input window that’s labeled "sequence in FASTA format". Click on the "Submit" button below the window. Within a minute or two an ORF viewer window will appear on top showing the ORFs as red lines with little arrows indicating the direction of the ORF (right arrow for the + strand.

Below, on the right will be the ORFs listed in order of their length in nucleotides and amino acids. Usually, the longest ORF is the true one. The list will also indicate if the sense strand for the ORF is the + strand (left to right as written) or the - strand (the complement, reading right to left), and show the start and stop positions on the nucleotide strand. Below left is a window showing the sequence of the selected ORF. The “Display ORF as” above the window allows selection of protein only, nucleotide only, or both (CDS translation).

Click on the longest blue ORF (reading frame +1) and examine the sequence that opens on the left in CDS translation view.

NOTE 2: Below the data input window of OrfFinder is a menu to choose the genetic code. The genetic code for nuclear-encoded genes is “standard” for plants and animals, but it varies somewhat for mitochondrial genes and the nuclear-encoded genes of some protists. Alternative start codons, such as GUG in bacteria, can also be taken into account. These codon variants are not extreme, and a common ancestry of all existent genetic codes is still a reasonable hypothesis.

Q1. How long is the protein (number of amino acids)? At which base pair (number in the DNA sequence) does the ORF start? At which base pair (number in the DNA sequence) does the ORF end? What stop codon is used? What are the first and last amino acids in the sequence? Note some of this information is summarized in the table to the right of the sequence.

Now click on the third largest ORF (ORF32) in the -2 reading frame and examine the sequence.

Q2. How long is the protein? At which base pair (number in the DNA sequence) does the ORF start? At which base pair (number in the DNA sequence) does the ORF end? What stop codon is used? What are the first and last amino acids in the sequence? Why does the DNA sequence count down rather than up as in the case above? (sequence viewer on top of the ORF Finder page may help)

Q3. Explain why OrfFinder needs to analyze six different reading frames (+1,+2,+3 and -1, -2, -3). Refer to your answers in the previous question to help answer this one. (Additional hints: DNA is double-stranded,atriplet genetic code).

II. Search for and view similar protein sequences using Protein BLAST*

OrfFinderresults can be directly linked to BLAST searches. BLAST (Basic Local Alignment Search Tool) is a program that searches sequences at GenBank for matches to an input sequence. There are a number of types of BLAST searches, including searches that look for matching nucleotide sequences, protein sequences, and protein sequences to DNA sequences "translated" by the BLAST program, as well as specialized programs to design PCR primers for a nucleic acid sequence of interest.

Since we are working with a translated DNA sequence (i.e., a polypeptide sequence) from OrfFinder results, we will use the default protein BLAST (blastp) from the site which matches your input protein sequence (the translated ORF) with a "translated" coding gene database. The BLAST algorithm looks for matches between your “query” sequence and subject sequences and creates alignments between them.

*NOTE 3: Sequence databases are very large, and are increasing at a high rate. The BLAST button from OrfFinder uses the smallest database, SwissProt, as a default. The other choices include RefSeq protein, a GenBank database filtered to contain only verified results, and “nr”, an abbreviation for non-redundant, which a misnomer since it contains combined, multiple databases. Results from your initial BLAST using default settings with the RefSeq database will give results covering a large range of more related sequences and will serve to identify your protein. We will later make use of limits that will allow us to explore some specific relationships that are not otherwise obvious.

A. Initial BLAST – identify the sequence

To run BLAST on your protein sequence, again select ORF1. Choose the protein reference sequence (RefSeq)database (not the default SwissProt) and click on the BLAST button below the sequence window (NOT the Smart BLAST). This will open a new BLAST entry page. Keep the default parameters for blastp. [*Double-check database on entry page]; Select "show results in a new window" and click on the BLAST button near the bottom of the page. The program may take a few minutes to run, with the results in a new tab. Don’t close the BLAST entry page- you will rerun it with different parameters to facilitate finding information regarding relationships to homologous sequences in other organisms and in humans.

*Format for the BLAST output has recently changed, mostly for the better. The results are split up into multiple tabs, highlighted in the screenshot below. The tab will initially open to a tab of Descriptions. The Descriptions will give you a list of matches (proteins in the database), listed in order of total score. Also listed is query cover (the % of amino acids in the query that aligned, and the E value.

The Descriptions is a list of “hits”, the best matching at the top. The list gives the name of the protein, and often the organism. You will also see score and E values. The higher the score, the better the match; the smaller the E value, the less likely the match is just by chance. The E values are particularly dependent on the length of your sequence and the database size.

Q4. Look list of the matches. What is the name of the protein? In what organism is the best match found? (Use the top several listings which have the best match). Note that the best match is 100% identity. What are the next several organisms? Taxonomy details will be examined later, but you may know these well known primates; but if not, click on the scientific name.

Q5. Describe what the E value signifies.* If your query sequence was quite short, how would the E value be affected in a BLAST search? Why?

*NOTE 4: You can get more information about E values and the BLAST algorithms by opening the links at the top of the page in a new window for videos and for how to read this report. There are 2 short and good videos on E values (the first one provides sufficient background).***as of 10/11/22 above link does work. Below is an alternative link to the part one. https://www.youtube.com/watch?v=ZN3RrXAe0uM

Do not close the tab/window with these BLAST results - you will be using the links from it for the next several questions.

B. Examine alignments

To view pairwise alignmentsof the sequences and to identify differences, click the check box top left of the list to select all of the sequences (there are times you will limit the alignments to simplify and speed up your analysis). Click on the Alignments tab. Just under the tabs, change alignment view to “pairwise with dots for identities”, then scroll down to examine some of the sequences. The query is the input sequence (top line) and the subject is the match found in the database. The dots indicate the identical amino acid. Differences are indicated by red letters.

Q6. Examine the alignment with the first match. Are all the amino acids identical between the query and the subject (identified as NP_005219.2)?

Q7. Examine the second alignment, which is for Pan troglodytes (chimpanzee)(XP_519102.3). How many amino acids differ between the two sequences? Would you expect substituting an Ile residue with Leu, or a Ser with a Thr, to have a major effect on the overall protein structure and function? Why or why not? Note at the top of the alignment, the number of “identities” and the number of “positives” is shown. Which of the substitutions in the alignment is not a “positive”? Why? Is the chimpanzee sequence as second best match surprising?

If you scroll down the description list, what other organisms do you recognize? Unless you just took Vert Zoo, probably not many!

C. Examine taxonomic Relationships

1. Lists based on homology

To examine taxonomic/evolutionary relationships of your "hits" more conveniently, results can be displayed as taxonomy reports and as tree diagrams. Click the Taxonomy tab and examine the default Lineage view.
Taxonomy reports produces a table-like listing with the broadest taxonomic categories toward the left, and narrower groupings, like primates and rodents further to right, all under the broadest category umbrella of Eutheria – placental mammals. Organisms listed furthest to the left are usually less taxonomically related to our human sequence. If you click on Organism, rather than Lineage, the hits are grouped by organism and includes the common name- handy if you haven't run into a Loxodonta africana recently (you would probably get hurt if you did).

Q8. What is the name given for the protein for all the matches? What taxonomic group, as a whole, hasthe sequences most similar to human? What is one of the least human-like organism (taxonomically) listed with an orthologous sequence? Would you say that this protein was evolutionarily conserved? Why or why not?

Q9. How many human matches do you find (list by organism)? How are these different sequences distinguished in their names? Note that this type terminology (isoform)is typical for normal protein variants encoded by one gene, NOT different genes. How do you think these isoforms arise?

2. Tree Diagrams*

Distance Tree of Results creates a cladogram-like tree diagram grouping "hits" by their sequence similarities, into various branches. The distances between the branches (length of the lines), and between the leaves on the branches, are proportional to their differences in sequence. The branches themselves are largely organized by taxonomy. The link for the distance tree is just above the Graphic Summary tab. Clicking on the link (make sure all sequences are still selected) will open the tree in a new tab.

Click on the link for Distance Tree of Results. From the tools menu, try the 4 different views: rectangle view, slant view, radial and force. Make use of the color coding shown to the right to help you find different taxonomic groups. I personally find "slant view" or "rectangular view" to be more easily interpreted. Only the sequences with a relatively high match will be presented. Controls on the right will allow you to alter the view. The default is a collapsed view, which is sufficient for this part of our exercise. To see detailed labels for everything, choose “expand all” from the tools, and click on TXT to optimize the view. You can likewise change the sequence label setting between sequence title and taxonomic name to more easily identify sequences and organisms. If the tree is too big to get the "big picture”, download the tree as a pdf file, an option from the tools menu. Viewing it in Adobe reader or a separate browser tab may be easier for you.

Q10. The query sequence is highlighted in yellow in the tree. What else is grouped with it on the branch?

NOTE 5: The trees created using the pairwise alignments of BLAST are not used for publication grade phylogenetic trees. The BLAST-based trees are good for a first approximation. Publication grade trees are derived from an alignment of all the sequences together (a multiple alignment) instead of just pairwise alignments. A free and reasonably easy-to-use program called MEGA (currently MEGA 11) is available for Windows and MacOS for anyone interested in creating sequence-based phylogenetic trees (https://www.megasoftware.net/).

D. Conserved functional units of 3D structure - Conserved Domains

Select Graphic Summary from the BLAST results page. On top, a map showing detected conserved domains in the polypeptide will be shown. Domains are functional substructures within an overall polypeptide structure. Conserved domain search algorithms are automatically run with protein BLASTs and are actually faster than the alignment determination. Find information regarding the functions of these domains by opening the image as a link.

If you want to see the 3D structures, you can download the stand-alone program Cn3D for PC or Mac and use the menu to select Style>Rendering Shortcuts>Worms.

Q11. Use the descriptions and links on the page to describe the characteristics and roles of the following regions. In addition, write down the range of amino acid residues corresponding to these regions in your query sequence. Finally, indicate whether you think these domains are likely to be intracellular or extracellular, based on their functions. For the Receptor L domain describe the secondary structure.

PTKc-EGFR domain

Transmembrane domain

Furin-like repeats

Receptor L domain

Growth factor receptor domain IV

NOTE 6: Domain descriptions and assignments do change over time. The usual numbering of the EGFR domains from N terminus to C terminus includes I – first L domain, II first furin-like repeat, III – second L domain, and IV – the second furin-like repeat – but sometimes counted as 2 furin-like repeat domains; recently described is the growth factor receptor IV domain linking furin-like repeats to L-domains.

III. Reference Sequence (RefSeq) - an integrated and reliable sequence database

NCBI hosts a large number of sequence databases. The RefSeq databases for proteins and nucleic acid sequences are among the most reliable. They are reviewed and annotated sequence entries that are summarized compilations of multiple entries. In contrast, the nr (“non-redundant”) database used in the default BLAST is in fact highly redundant and contains many partial sequences, essentially identical sequences, and sometimes misidentified sequences. Other databases, such as environmental or “metagenomic” samples are even more redundant and unfiltered. We will use the RefSeq entry for your protein sequence as a gateway to more information regarding the protein. Most RefSeq accession numbers are preceded with NP_,XP_(proteins) NR_,NC_(genome sequences) or NM_,XM_(RNA/cDNA).

Back to your BLAST results- Open the link for the entry (at or near the top), ref|NP_005219.2|, in a new window to examine the GenBank entry. (If you have lost your BLAST results page you can paste the accession number NP_005219.2 into an NCBI search and open the protein link). Once you have opened this page, you can close the previous BLAST pages. You will want to keep this page open, since it will be used to access additional links.

As you scroll down the page you will see links to related PubMed entries, mostly primary source journal articles. To the side you will see many links: tools including BLAST; related sequences; 3D structures; resources such as OMIM, Gene, and Conserved Domain Database (CDD).

Q12. Read the summary just below the list of PubMed entries. What specific location in a mammalian cell will this protein be found? What does this receptor protein bind? What does activation of this receptor protein cause in cells? With what disease have mutations been associated?

As you continue to scroll down, you will find a detailed list of structural features of the polypeptide sequence (located near the bottom of the page). The list includes regions such as the signal sequence, domains, disulfide bonds, glycosylation sites, and phosphorylation sites. These descriptions run, roughly, from N to C terminus, from top to bottom. Clicking the feature link will highlight the corresponding sequence and a pop-up with some details.

Q13. In the descriptions, click on the following (listed from top to bottom, N-terminus to C-terminus, and describe the information for the sequences that are highlighted:

“sig-peptide” – role in translating proteins of endomembrane system?

“mat-peptide” – what happened to the signal peptide?

“Region” 634..674 – how would you describe the properties of amino acids 650-668 in this region? Why would is this to be expected?

“Region” 704..1016 – would you expect this region to be intracellular or extracellular?

“Site” 998 – how does this relate to the region described above?

IV. Back to nucleic acids - How most protein sequences are determined

Go back to the top of the protein reference sequence entry and click the link for DBSOURCE REFSEQ: accession NM_005228.5 near the top of the page. This is the actual source of the amino acid sequence, translated using a program similar to ORF finder. It is also a Reference Sequence (RefSeq), which means it is a reviewed and highly annotated entry.

In the top portion of the entry you will find a summary of information.

Q14. What is the name (definition) of the entry?

Q15. How many basepairs is the reported nucleotide sequence?

Q16. Why does it say mRNA at the top of the entry? Does the sequence at the bottom of the page read like RNA or DNA? Hint: Note the 3' end of the sequence at the top of this page that you used for ORF Finder. That sequence corresponds to NM_005228.3, an earlier version of the RefSeq sequence.

Q17. For most eukaryotic gene sequences, in contrast to mRNA sequences such as the one for EGFR that we analyzed, OrfFinder is of limited use: a long, continuous open reading frame such as the one we obtained with the mRNA sequence is extremely rare using actual genomic DNA sequences. WHY?

To obtain sequences of mRNA, mRNA is typically used as a template to synthesize cDNA (complementary DNA)using a reverse transcriptase. The cDNA can be cloned and sequenced, or sequenced directly using newer high throughput techniques (referred to as RNASeq).

NOTE 7: RefSeq database entries are crucial for reasonably assured sequence comparisons using BLAST and other tools,and have numerous links to related information. An even more comprehensive derivative database for the more common organisms is the Gene database. Although we are not making use of it in this exercise, (there are links from both the protein and nucleic acid RefSeq pages) it is very useful as a “one-stop-shop” for information on specific genes and their corresponding transcripts and proteins.

Modifying BLAST searches for more focused queries

A. Are EGFR homologs present in all types of cells, or just “higher” multicellular organisms?

Our initial BLAST search provided a direct answer to the question- What is the identity of our protein sequence? We also found that the protein was found in a wide variety of placental mammals. Examination of some of the roles reported for the gene product suggest significance in regulation cell division leading us to pose the following questions: Are EGFR homologs present in all types of cells, or just “higher” multicellular organisms? How about plants, fungi, unicellular protists, and bacteria?

Q18. What are your hypotheses regarding these organisms? What is your rationale for your hypotheses – in other words, why do you think they are reasonable?

To test these hypotheses, blast the protein sequence, limiting the search to only reference sequences database as before. Next, limit Organism using some taxonomic limits to filter out the highly similar sequences in closely related organisms. Frequently typing in the common name of the group (like sponges) will bring up appropriate suggestions.

Shortcut for this part to avoid ORF Finder steps: Go to the NCBI BLAST homepage https://blast.ncbi.nlm.nih.gov/Blast.cgi and select Protein BLAST. Enter the accession number for human EGFR (NP_005219.2) into the query entry box.

Try the following 4 organism limits, plus at least two more of your choice. If you use new tabs for each search, you can run and view them all simultaneously for comparison.The taxonomy reports and/or distance trees may help your interpretation. If your organismal biology is a little rusty (like mine is), look up some of the scientific names. You can often get more information by opening the link to the sequence and opening the taxonomy link. This can also be done from the taxonomy reports link.

exclude mammalia

includeonly arthropods (if you want fruit flies specifically, narrow the organism limit)

includeonly bacteria - (Use Bacteria taxid:2; this will include characterized eubacteria; the taxid:77133 is uncultured environmental samples, not included in the reference sequence database)

includeonly sponges (Porifera)

Q19. For each search, note the extent (query cover %) and regions of homology; in other words, is the entire sequence homologous, or just a specific domain or region? This readily done using the graphic summary tab and comparing the alignment score figure summaries with the domain map above it. What range of positive matches in the amino acid alignments (in %)do you find for each group? What specific organisms in each of the searches are familiar to you? Which are least related to humans? How do the overall patterns of homology compare to standard phylogenetic/evolutionary trees of organisms? Do any particular regions/domains of the protein appear to be more conserved or evolutionarily more ancient?

B. Are there additional genes similar to EGFR in humans?

Mammalian development and cell division are complex, highly regulated processes. Is it possible that genes similar to EGFR evolved and function in the human genome? To test this possibility, modify your BLAST search by limiting the search to only human reference sequences, "Homo sapiens" in the Organism dataset limit. You will want to view these results in the distance tree view as you did before, rectangular or slant view for clarity, select in tools “expand all”, and click on TXT to optimize the view.

Q20. Some the sequences are semi-redundant, showing precursor sequences and splicing variants (isoforms in the tree). These highly related sequences are clustered into discrete branches - most with multiple isoforms listed as a,b,c, or by Roman numerals, with distinct gene names most obvious (in my opinion) in the rectanglar format. An additional note to help you make sense of this- EGFR also goes by the name of erbB-1.

List the names of the 4 distinct genes.

These sequences have a special type of homology referred to as paralogous. How might paralogous sequences evolve?

Note 8: When this exercise was first designed in 2001, the amount of sequence data was significantly less, and our distant relations (including fruit flies and fish) and paralogous sequences were readily found in the initial BLAST results from OrfFinder without modification of the BLAST options. Optimizing BLAST to address specific questions is a useful skill, and we have touched upon some of these tricks. The data explosion has also necessitated development of derivative databases such as Gene that can summarize the essentials and link to the details, and we’ll make use of some of these later this semester.