Objective To provide comprehensive genetic information for the development of microsatellite markers and the mining of functional genes in Phyllanthus emblica by characterizing the transcriptome of P. emblica in dry-hot valleys in Yunnan.
Method Transcriptome sequencing was conducted on young leaves of Ph. emblica using Illumina Hiseq 4000, followed by filtering, de novo assembly and clustering. Sequence similarity analysis and annotation of the obtained Unigenes were performed based on databases like NCBI-non-redundant (NR) protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), KEGG database, SwissProt, PlantTFDB, and PRGdb.
Result In total, 10.52 Gb Clean reads with Q20 of 98.47% and Q30 of 95.28% were generated. A total of 76 881 Unigenes with an average length of 713 nt and N50 of 1 257 nt were obtained by de novo assembly and clustering with Clean reads. Out of them, 44 768 Unigenes were functionally annotated against four protein databases. The Unigenes were roughly divided into 25 categories according to COG function, and were grouped into three functional categories (including biological processes, cellular components and molecular function) and 47 sub-categories based on GO functional annotation. KEGG analysis showed that the Unigenes could be fallen into six categories and 21 metabolic pathways, of which about 3/5 were Metabolism. A total of 42 953 CDS were detected based on the results of functional annotation, and 2 058 CDS were predicted using ESTScan with the remaining Unigenes. And 56 Transcription Factor families and 18 resistance genes were predicted.
Conclusion The Unigenes of transcriptome in Ph. emblica show high quality, good integrality, abundant genes and various functions, which could lay an important foundation for further study of functional gene excavation, resistance mechanism analysis, molecular marker development and molecular assisted breeding of Ph. emblica and other congeneric species.