Molecule Tutorials - Herong's Tutorial Examples - v1.26, by Herong Yang
RefSeq Proteins of Human Genome
This section provides a tutorial example on how to download the RefSeq Proteins data file, which containes 59,248 human proteins in FASTA format.
What Is RefSeq Proteins of Human Genome? - RefSeq Proteins of Human Genome is a data file in FASTA format that contains about 60,000 human protein sequences provided by NCBI (National Center for Biotechnology Information.
Here is what I did to download the RefSeq Proteins of Human Genome data file.
1. Get the data file with "curl" command:
herong$ curl ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation\ /GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz > proteins.gz
2. Unzip and verify the file.
herong$ gunzip proteins.gz herong$ head -100 proteins >NP_000005.3 alpha-2-macroglobulin isoform a precursor [Homo sapiens] MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTVSASLESVRGNRSLFTDLEAE NDVLHCVAFAVPKSSSNEEVMFLTVQVKGPTQEFKKRTTVMVKNEDSLVFVQTDKSIYKPGQTVKFRVVSMDENFHPLNE LIPLVYIQDPKGNRIAQWQSFQLEGGLKQFSFPLSSEPFQGSYKVVVQKKSGGRTEHPFTVEEFVLPKFEVQVTVPKIIT ILEEEMNVSVCGLYTYGKPVPGHVTVSICRKYSDASDCHGEDSQAFCEKFSGQLNSHGCFYQQVKTKVFQLKRKEYEMKL HTEAQIQEEGTVVELTGRQSSEITRTITKLSFVKVDSHFRQGIPFFGQVRLVDGKGVPIPNKVIFIRGNEANYYSNATTD EHGLVQFSINTTNVMGTSLTVRVNYKDRSPCYGYQWVSEEHEEAHHTAYLVFSPSKSFVHLEPMSHELPCGHTQTVQAHY ILNGGTLLGLKKLSFYYLIMAKGGIVRTGTHGLLVKQEDMKGHFSISIPVKSDIAPVARLLIYAVLPTGDVIGDSAKYDV ENCLANKVDLSFSPSQSLPASHAHLRVTAAPQSVCALRAVDQSVLLMKPDAELSASSVYNLLPEKDLTGFPGPLNDQDNE DCINRHNVYINGITYTPVSSTNEKDMYSFLEDMGLKAFTNSKIRKPKMCPQLQQYEMHGPEGLRVGFYESDVMGRGHARL VHVEEPHTETVRKYFPETWIWDLVVVNSAGVAEVGVTVPDTITEWKAGAFCLSEDAGLGISSTASLRAFQPFFVELTMPY SVIRGEAFTLKATVLNYLPKCIRVSVQLEASPAFLAVPVEKEQAPHCICANGRQTVSWAVTPKSLGNVNFTVSAEALESQ ELCGTEVPSVPEHGRKDTVIKPLLVEPEGLEKETTFNSLLCPSGGEVSEELSLKLPPNVVEESARASVSVLGDILGSAMQ NTQNLLQMPYGCGEQNMVLFAPNIYVLDYLNETQQLTPEIKSKAIGYLNTGYQRQLNYKHYDGSYSTFGERYGRNQGNTW LTAFVLKTFAQARAYIFIDEAHITQALIWLSQRQKDNGCFRSSGSLLNNAIKGGVEDEVTLSAYITIALLEIPLTVTHPV VRNALFCLESAWKTAQEGDHGSHVYTKALLAYAFALAGNQDKRKEVLKSLNEEAVKKDNSVHWERPQKPKAPVGHFYEPQ APSAEVEMTSYVLLAYLTAQPAPTSEDLTSATNIVKWITKQQNAQGGFSSTQDTVVALHALSKYGAATFTRTGKAAQVTI QSSGTFSSKFQVDNNNRLLLQQVSLPELPGEYSMKVTGEGCVYLQTSLKYNILPEKEEFPFALGVQTLPQTCDEPKAHTS FQISLSVSYTGSRSASNMAIVDVKMVSGFIPLKPTVKMLERSNHVSRTEVSSNHVLIYLDKVSNQTLSLFFTVLQDVPVR DLKPAIVKVYDYYETDEFAIAEYNAPCSKDLGNA >NP_000006.2 arylamine N-acetyltransferase 2 [Homo sapiens] MDIEAYFERIGYKNSRNKLDLETLTDILEHQIRAVPFENLNMHCGQAMELGLEAIFDHIVRRNRGGWCLQVNQLLYWALT ...
3. Count number of proteins. The output shows that 59,248 proteins have been recorded in the data file.
herong$ grep ">NP_" proteins | wc 59248 511098 4345961
Table of Contents
Molecule Names and Identifications
Nucleobase, Nucleoside, Nucleotide, DNA and RNA
ChEMBL Database - European Molecular Biology Laboratory
PubChem Database - National Library of Medicine
►INSDC (International Nucleotide Sequence Database Collaboration)
Reference Genome Sequence Data File
►RefSeq Proteins of Human Genome
HGNC (HUGO Gene Nomenclature Committee)