Readme

You might also like

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 3

Technical Overview

Regular expressions represent a powerful tool for describing and manipulating text data. These are supported by a wide variety of programming and scripting l languages, text editors, and now by Oracle Database 10g SQL and PL/SQL. Regular expressions are extremely useful, because they allow programmers to work with text in terms of patterns. They are considered the most sophisticated means of performing operations such as string searching, manipulation, validation, and formatting in all applications that deal with text data. Also they are used in bioinformatics to assist with identifying DNA and protein sequences. Linguists use regular expressions to aid research of natural languages. The introduction of native regular expression support to SQL and PL/SQL in the Oracle Database revolutionizes the ability to search for and manipulate text within the database by providing expressive power in queries, data definitions and string manipulations. Application Overview This application uses Regular Expression for extracting and analyzing DNA data from SGD database. SGD(Saccharomyces Genome Database) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Given a region you can query the database to get the yeast genome sequence from this site. This sample uses the regular expressions to parse the output from the raw HTTP data and store the DNA sequence in the database. Further you can run the regular expression queries to identify specific patterns f from the database. A "regular expression" is a set of character that represents one or more strings. To find if a certain pattern is present within a given record such as DNA or protein we construct a regular expression that represents that pattern. For example, the pattern "GGATGA" represents the DNA sequence "GGATGA" and no other sequence. The regular expression " GAA[ACGT] {4}TTC" represents GAAACGTTTC , GAAAAAATTC etc. Here [ACGT]{4} means that the sequence may contain any combination of these characters or even all four can be of same character. You can observe from these examples that some regular expressions characters match only one character (i.e. G represents only Guanine) while others can match much more than one character. Here within lays the power of regular expression searches. Using relatively small number of symbols one can specify many different patterns to search for in one single search. The sample application uses the DNASEQ function to connect to the SGD database and retrieve the HTTP stream data. This stream is then parsed using Regular Expressions, to extract only the DNA sequence by eliminating the control characters. The DNA sequence is further processed, to check whether the given sequence possesses any of the enzyme patterns and list their first o occurrence position within the sequence.

T Terminology Definition The directory where the sample is extracted

<SAMPLE_HOME>

Configuring the Application Unzip the downloaded RegExpDNASample.zip. Extract the file contents into < <SAMPLE_HOME> directory. This creates RegExpDNASample folder with a all the files and folders. Open the command prompt and move to <SAMPLE_HOME>/REGEXPDNASample/src folder by executing the f following command, cd <SAMPLE_HOME>/REGEXPDNASample/src c Open SQL prompt. Connect as SCOTT/TIGER and run the config.sql script from <SAMPLE_HOME>/REGEXPDNASample/src folder. This will create the n necessary database objects ( table, function) for this application. E Example,SQL> @config.sql S R Running the Application From the SQL prompt, run the dna_analysis.sql file by issueing the following c command,SQL>@dna_analysis.sql Enter the value for the 'region' (Refer S the table below for the sample regions). This PL/SQL block executes the DNASEQ function which connects to the http://www.yeastgenome.org website and extracts the DNA sequence. The sequence is then stored in the DNA_DB table. Also this PL/SQL block searches for certain enzyme patterns and prints their first occurrence position within the extracted D DNA sequence. Note:You may input any of the following regions for N a analysis: YMR317W, YMR010W, YBL016W, YBR077C, YAL004W Following are the few enzyme names used in the analysis and their recognition patterns Equivalent Oracle Regular Enzyme Name Recognition Pattern Expression Pattern EcoRI GAATTC GAATTC BamHI GGATCC GGATCC HindII GTYRAC GT[CT]{1}[GA]{1}AC Ama87I CYCGRG C[CT]{1}CG[GA]{1}G Asp700I GAANNNNTTC GAA[ACGT]{4}TTC Sample Application Files

This section will provide a tabular listing of the sample application files, along with their respective directory locations and a description of what they do in the overall scheme of the application. Directory RegExpDNASample\do c File readme.html Description This file This SQL file is used to configure the sample. This creates the necessary table and function The file that creates DNASEQ function This PL/SQL code executes the DNASEQ stored procedure and runs the Regular Expression search on the retrieved sequence. The file runs the SQL script to search patterns in the locally stored database.

RegExpDNASample\src

config.sql

RegExpDNASample\src

dnaseq.sql

RegExpDNASample\src

dna_analysis.sql

RegExpDNASample\src search_localdb.sql

TroubleShooting You may enocunter "ORA-29273: HTTP request failed" error while running the dna_analysis.sql file if you are behind a firewall. To solve this problem, open the dnaseq.sql file, search for UTL_HTTP.SET_PROXY, uncomment the line containing UTL_HTTP.SET_PROXY and edit the settings and replace 'www.yourproxy.com' with the correct proxy server address.

You might also like