Piiratud võimsusega regulaaravaldistele sobituvate sõnade loendamine
Date
2013
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Käesolev bakalaureusetöö keskendub ühe algoritmi arendamisele ja implementeerimisele. See algoritm moodustab ühe osa suuremast biomarkerite otsimise töövoost. Töövoogu arendatakse Tartu Ülikooli BIIT grupis ühe koostööprojekti raames. Algoritmi sisendiks on suur kogus andmeid erinevate bioloogiliste proovide kohta.
Andmed nende proovide kohta on esitatud kasutades lühikesi sõnu ja vastavaid esinemise sagedusi, mille kaudu on võimalik tuvastada olulisi erinevuseid proovide vahel. Lisaks on teada, et mõningatel juhtudel võib piiratud võimsusega regulaaravaldis anda palju paremat infot proovide erinevuste kohta. Samas regulaaravaldistele vastavad sagedused ei ole ette teada vaid tuleb arvutada sisendiks proove iseloomustavate sõnade ja vastavate sageduste põhjal.
Selle probleemi saab jagada kaheks osaks. Esiteks tuleb leida kõik sõnad mis vastavad ette antud regulaaravaldisele. Selle saavutamiseks kasutame suuri bitivektoreid, mida hoitakse pidevalt mälus. Teiseks tuleb arvutada regulaaravaldise sagedused regulaaravaldisele vastavate sõnade sageduste põhjal. Kiirus on siinkohal saavutatud hõreda maatrikis pidevalt mälus hoidmisega. Maatriksile vastava andmestruktuuri formaat on valitud selliselt, et maatriksi ridu saaks võimalikult kiirelt proovide veergude kaupa kokku liita.
Bakalaureusetöö tulemuseks olev algoritm on implementeeritud programeerimiskeeltes Python ja C++. Töös on toodud mõlema implementatsiooni detailid ning lõpuks on võrreldud nende kiirust sama ülesande lahendamiseks arendatud naiivse lahendusega.
This bachelor's thesis concentrates on developing and implementing an algorithm for a subtask in a biomarker discovery pipeline. The pipeline itself is being developed at the BIIT group in the University of Tartu as part of an industrial collaboration. The input of this algorithm is data about a large number of different biological samples. The data about these samples is represented by using short words and corresponding frequencies, which allow us to find significant differences between samples. It is also known that in some cases a limited regular expression would be a much better representation of these differences. However the frequencies that correspond to any given regular expression need to be calculated based on words and the frequencies of these words. This problem can be divided into two parts. First we need to find all of the words that match the given regular expression, this is achieved by using large bitvectors that will be constantly stored in memory. The second part concentrates on calculating the frequencies based on matching words. Speed is here achieved by storing frequencies in memory as a sparse array in format that allows fast adding of rows. The resulting algorithm is implemented in both Python and C++. The details of these implementations are given and finally the speed of both of these implementations is measured against a naive solution. The bachelors thesis results in an program that is able to find the frequencies of input regular expressions with sufficient speed.
This bachelor's thesis concentrates on developing and implementing an algorithm for a subtask in a biomarker discovery pipeline. The pipeline itself is being developed at the BIIT group in the University of Tartu as part of an industrial collaboration. The input of this algorithm is data about a large number of different biological samples. The data about these samples is represented by using short words and corresponding frequencies, which allow us to find significant differences between samples. It is also known that in some cases a limited regular expression would be a much better representation of these differences. However the frequencies that correspond to any given regular expression need to be calculated based on words and the frequencies of these words. This problem can be divided into two parts. First we need to find all of the words that match the given regular expression, this is achieved by using large bitvectors that will be constantly stored in memory. The second part concentrates on calculating the frequencies based on matching words. Speed is here achieved by storing frequencies in memory as a sparse array in format that allows fast adding of rows. The resulting algorithm is implemented in both Python and C++. The details of these implementations are given and finally the speed of both of these implementations is measured against a naive solution. The bachelors thesis results in an program that is able to find the frequencies of input regular expressions with sufficient speed.