Abstract: | The biological knowledge, or at least a big part of it, is divided in different databases. Thanks to the advances in the computation power, we can analyse all this data using data mining, statistical methods and machine learning
techniques. In this work, we will focus in two important databases that can be used to find relations between populations and fenotypes using SNPs (Single Nucleotide Polymorphism) as features. For this work, we will use information from 1000Genome, a database containing the sequentiation of more than 1000 humans' genome and from GWAS, another database that contains the relation between SNPs and traits (i.e., asthma or cancer). Different ways of extracting information will be presented, including machine
learning. After that, a performance analysis and optimization techniques will be applied both to computation speed (parallelism) and I/O (data distribution). Finally, a comparative analysis of machine learning algorithms will be presented. |