The Database of Genotypes and Phenotypes (dbGaP) was developed by the National Center for Biotechnology Information (NCBI) to store and distribute the results of various studies examining the interaction of the genotype and phenotype. It is a public repository of data relating to phenotype, exposure, genotype, sequence and associations between them at the individual level. Searching for relevant studies of particular interest accurately and comprehensively is a challenging task due to the keyword-based search method of the dbGaP Entrez system. Text mining is an emerging research field that allows users to extract useful information from text documents and deals with retrieval, classification, clustering, and machine learning techniques to classify different text documents. In this work we proposed and implemented text classification (naïve bayes) and text clustering (K means) algorithm trained on dbGaP study text to identify heart, lung and blood studies. Performance of classifiers against dbGaP keyword search results. It has been established that text classifiers are always the best complement to the dbGaP document retrieval system. Keywords: bioinformatics, data mining, text classification, genotype and phenotype database. Introduction 1.1 dbGaPLThe National Library of Medicine (NLM), part of the National Institutes of Health (NIH), announces dbGaP, a new database designed to store and distribute data from genome-wide association studies. GWA studies uncover the association between specific genes and observable traits, such as weight and blood pressure, or the presence or absence of a disease or condition (phenotype information). Linking phenotype and genotype data provides information about genes that may be involved in a disease process or c...... middle of paper ...... subsequent pattern extraction is extracted from natural language text rather than from structural databases. Text mining, data mining and machine learning algorithms are in high demand in the field of bioinformatics. Text mining techniques applied to bioinformatics importantly involve methods such as -Classification Text documents are organized into groups of pre-labeled classes. The learning schemes learn through training text documents, and the efficiency of these systems is tested using test text documents. Common algorithms include decision tree learning, naive Bayesian classification, nearest neighbor, and neural network. This is called supervised learning. Clustering This is an unsupervised learning method. Text documents here are not labeled, and patterns inherent in the text are revealed through clustering. This can also be used as a preliminary step for other text mining methods.
tags