Informatica: Multi Domain Master Data Management Headquarters Redwood City, California, United States

Main products
Informatica's product portfolio focused on Data Integration: Application Information Lifecycle Management , B2B Data Exchange, Cloud Data Integration, Complex Event Processing, Data Masking, Data Quality, Data Replication, Data Virtualization, Enterprise Data Integration, Master Data Management, Messaging; currently at version 9.5 These components form a toolset for establishing and maintaining enterprise-wide data warehouses, including the key ETL processing. It has a customer base of over 4,500 companies. In 2006, Informatica launched its Informatica Cloud business. ETL: Extract, transform, load ETL is a process in database usage and especially in data warehousing that involves:

Extracting data from outside sources it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse)

The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent processes will go. ETL Architecture Pattern Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization/format. Common data source formats are relational databases

and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

[edit] Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:

Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null). Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female) Encoding free-form values (e.g., mapping "Male" to "1") Deriving a new calculated value (e.g., sale_amount = qty * unit_price)


data from multiple sources (e.g., lookup, merge) and deduplicating the data

Aggregation (for example, rollup summarizing multiple rows of

data total sales for each store, and for each region, etc.)

Generating surrogate-key values or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a commaseparated list specified as a string in one column as individual values in different columns) Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table)

Lookup and validate the relevant data from tables or referential files for slowly changing dimensions. Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data. [edit] Load The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information, frequently updating extract data is done on daily, weekly or monthly basis. Other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly. To understand this, consider a DW that is required to maintain sales records of the last year. Then, the DW will overwrite any data that is older than a year with newer data. However, the entry of data for any one year window will be made in a historicized manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW. As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply

(for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process. For example, a financial institution might have information on a customer in several departments and each department might have that customer's information listed in a different way. The membership department might list the customer by name, whereas the accounting department might list the customer by number. ETL can bundle all this data and consolidate it into a uniform presentation, such as for storing in a database or data warehouse. Another way that companies use ETL is to move information to another application permanently. For instance, the new application might use another database vendor and most likely a very different database schema. ETL can be used to transform the data into a format suitable for the new application to use.

An example of this would be an Expense and Cost Recovery System (ECRS) such as used by accountancies, consultancies and lawyers. The data usually ends up in the time and billing system, although some businesses may also utilize the raw data for employee productivity reports to Human Resources (personnel dept.) or equipment usage reports to Facilities Management.

[edit] Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)

Publish (to target tables) Archive Clean up

[edit] Challenges
ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis can identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous environment. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day microbatch to integration with message queues or real-time change-data capture for continuous transformation and update

[edit] Performance
ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory. The fastest ETL record is currently held by Syncsort[1], Vertica and HP at 5.4TB in under an hour which is more than twice as fast of the earlier record held by Microsoft and Unisys. In real life, the slowest part of an ETL process usually occurs in the database

load phase. Databases may perform slowly because they have to take care of concurrency, integrity maintenance, and indices. Thus, for better performance, it may make sense to employ: Direct Path Extract method or bulk unload whenever is possible (instead of querying the database) to reduce the load on source system while getting high speed extract most of the transformation processing outside of the database bulk load operations whenever possible. Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Some common methods used to increase performance are:

Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking (disable constraint ...) in the target database tables during the load. Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step.

Generate IDs in the ETL layer (not in the database).

Drop the indices (on a table or partition) before the load - and recreate them after the load (SQL: drop index ...; create index ...).

Use parallel bulk load when possible works well when the table is partitioned or there are no indices. Note: attempt to do parallel loads into the same table (partition) usually causes locks if not on the data rows, then on indices.

If a requirement exists to do insertions, updates, or deletions, find out which rows should be processed in which way in the ETL layer, and then process these three operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through an API (using SQL).

Whether to do certain operations in the database or outside may involve a trade-off. For example, removing duplicates using distinct may be slow in the database; thus, it makes sense to do it outside. On the other side, if

using distinct will significantly (x100) decrease the number of rows to be extracted, then it makes sense to remove duplications as early as possible in the database before unloading data. A common source of problems in ETL is a big number of dependencies among ETL jobs. For example, job "B" cannot start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitioning of big tables and of their indices can really help. Another common issue occurs when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only three layers: Sources Central ETL layer Targets This allows processing to take maximum advantage of parallel processing. For example, if you need to load data into two databases, you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd). Of course, sometimes processing must take place sequentially. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.

[edit] Parallel processing

A recent[update] development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data. ETL applications implement three main types of parallelism:

Data: By splitting a single sequential file into smaller data files to provide parallel access. : Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

All three types of parallelism usually operate combined in a single job. An additional difficulty comes with making sure that the data being uploaded is relatively consistent. Because multiple source databases may have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points becomes necessary.

[edit] Rerunnability, recoverability

Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece. Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.

[edit] Virtual ETL

As of 2010[update] data virtualization had begun to advance ETL processing. The application of data virtualization to ETL allowed solving the most common ETL tasks of data migration and application integration for multiple dispersed data sources. So-called Virtual ETL operates with the abstracted representation of the objects or entities gathered from the variety of relational, semi-structured and unstructured data sources. ETL tools can

leverage object-oriented modeling and work with entities' representations persistently stored in a centrally located hub-and-spoke architecture. Such a collection that contains representations of the entities or objects gathered from the data sources for ETL processing is called a metadata repository and it can reside in memory[2] or be made persistent. By using a persistent metadata repository, ETL tools can transition from one-time projects to persistent middleware, performing data harmonization and data profiling consistently and in near-real time.[citation needed]

[edit] ETL Alternatives

Data integration by design methods provide an alternative to ETL type data integrations. ETL methods are not required as the data in the source data system is integrated in place by recasting the master data of the database. The data integration by design method achieves data integration in days and weeks as opposed to the months and years normally required for ETL type data integrations.

[edit] Best practices

Four-layered approach for ETL architecture design Functional layer: Core functional ETL processing (extract, transform, and load). Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting. Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management. Utility layer: Common components supporting all other layers. Use file-based ETL processing where possible Storage costs relatively little Intermediate files serve multiple purposes: Used for testing and debugging

Used for testing and debugging Used for restart and recover processing Used to calculate control statistics Helps to reduce dependencies - enables modular programming. Allows flexibility for job execution and scheduling Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises. Use data-driven methods and minimize custom ETL coding Parameter-driven jobs, functions, and job-control Code definitions and mapping in database Consideration for data-driven tables to support more complex codemappings and business-rule application. Qualities of a good ETL architecture design Performance Scalable Migratable Recoverable (run_id, ...) Operable (completion-codes for phases, re-running from checkpoints, etc.) Auditable (in two dimensions: business requirements and technical troubleshooting) Handling of non-desirable values (NULL values, erroneous values, etc.) See: Dealing With Nulls In The Dimensional Model (Kimball University) NULL DIMENSIONAL values NULL FACT values NULL PRIMARY and/or FOREIGN KEY values

Erroneous or undesirable values

[edit] Dealing with keys

Keys are some of the most important objects in all relational databases as they tie everything together. A primary key is a column which is the identifier for a given entity, where a foreign key is a column in another table which refers a primary key. These keys can also be made up from several columns, in which case they are composite keys. In many cases the primary key is an auto generated integer which has no meaning for the business entity being represented, but solely exists for the purpose of the relational database - commonly referred to as a surrogate key. As there will usually be more than one datasource being loaded into the warehouse the keys are an important concern to be addressed. Your customers might be represented in several data sources, and in one their SSN (Social Security Number) might be the primary key, their phone number in another and a surrogate in the third. All of the customers information needs to be consolidated into one dimension table. A recommended way to deal with the concern is to add a warehouse surrogate key, which will be used as foreign key from the fact table.[8] Usually updates will occur to a dimension's source data, which obviously must be reflected in the data warehouse. If the primary key of the source data is required for reporting, the dimension already contains that piece of information for each row. If the source data uses a surrogate key, the ware house must keep track of it even though it is never used in queries or reports. That is done by creating a lookup table which contains the warehouse surrogate key and the originating key.[9] This way the dimension is not polluted with surrogates from various source systems, while the ability to update is preserved. The lookup table is used in different ways depending on the nature of the source data. There are 5 types to consider,[10] where three selected ones are included here: Type 1: - The dimension row is simply updated to match the current state of the source system. The warehouse does not capture history. The lookup table is used to identify which dimension row to update/overwrite.

Type 2: - A new dimension row is added with the new state of the source system. A new surrogate key is assigned. Source key is no longer unique in the lookup table. Fully logged: - A new dimension row is added with the new state of the source system, while the previous dimension row is updated to reflect it is no longer active and record time of deactivation. Tools Programmers can set up ETL processes using almost any programming language, but building such processes from scratch can become complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes. By using an established ETL framework, one may increase one's chances of ending up with better connectivity and scalability. A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities. A common use case for ETL tools include converting CSV files to formats readable by relational databases. A typical translation of millions of records is facilitated by ETL tools that enable users to input csv-like data feeds/files and import it into a database with as little code as possible. ETL Tools are typically used by a broad range of professionals - from students in computer science looking to quickly import large data sets to database architects in charge of company account management, ETL Tools have become a convenient tool that can be relied on to get maximum performance. ETL tools in most cases contain a GUI that helps users conveniently transform data as opposed to writing large programs to parse files and modify data types - which ETL tools facilitate as much as possible.

Data extraction
Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.[1] Usually, the term data extraction is applied when (experimental) data is first imported into a computer from primary sources, like measuring or recording devices. Today's electronic devices will usually present a electrical connector (e.g. USB) through which 'raw data' can be streamed into a personal computer. Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge

where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping. The act of adding structure to unstructured data takes a number of forms

Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;

Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc using a standard set of commonly used headings (these would differ from language to language), eg Education might be found under Education/Qualification/Courses; Using text analytics to attempt to understand the text and link it to other information

Although the expression "data about data" is often used, the correct description would be "data about the containers of data". Descriptive metadata

Metadata (metacontent) is defined as data providing information about one or more aspects of the data, such as: Means of creation of the data Purpose of the data Time and date of creation Creator or author of data

Location on a computer network where the data was created used

For example, a digital image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document. Metadata is data. As such, metadata can be stored and managed in a database, often called a Metadata registry or Metadata repository. However, without context and a point of reference, it can be impossible to identify metadata just by looking at it. For example: by itself, a database containing several numbers, all 13 digits long could be the results of calculations or a list of numbers to plug into an equation - without any other context, the numbers themselves can be perceived as the data. But if given the context that this database is a log of a book collection, those 13-digit numbers may now be ISBNs - information that refers to the book, but is not itself the information within the book. The term "metadata" was coined in 1968 by Philip Bagley, in his book "Extension of programming language concepts" where it is clear that he uses the term in the ISO 11179 "traditional" sense, which is "structural metadata" i.e. "data about the containers of data"; rather than the alternate sense "content about individual instances of data content" or metacontent, the type of data usually found in library catalogues. Since then the fields of information management, information science, information technology, librarianship and GIS have widely adopted the term. In these fields the word metadata is defined as "data about data". While this is the generally accepted definition, various disciplines have adopted their own more specific explanation and uses of the term.

Data Virtualization
Data Virtualization has emerged as the new software technology to complete the virtualization stack in the enterprise. Metadata is used in Data Virtualization servers which are enterprise infrastructure components, alongside with Database and Application servers. Metadata in these servers is saved as persistent repository and describes business objects in various enterprise systems and applications. Structural metadata commonality is also important to support data virtualization and data federation.

Metadata and data warehousing Data warehouse (DW) is a repository of an organization's electronically stored data. Data warehouses are designed to manage and store the data whereas the Business Intelligence (BI) focuses on the usage of data to facilitate reporting and analysis. The purpose of a data warehouse is to house standardized, structured, consistent, integrated, correct, cleansed and timely data, extracted from various operational systems in an organization. The extracted data is integrated in the data warehouse environment in order to provide an enterprise wide perspective, one version of the truth. Data is structured in a way to specifically address the reporting and analytic requirements. The design of structural metadata commonality using a data modeling method such as entity relationship model diagraming is very important in any data warehouse development effort. An essential component of a data warehouse/business intelligence system is the metadata and tools to manage and retrieve metadata. Ralph Kimball describes metadata as the DNA of the data warehouse as metadata defines the elements of the data warehouse and how they work together. Kimball et al. refers to three main categories of metadata: Technical metadata, business metadata and process metadata. Technical metadata is primarily definitional while business metadata and process metadata are primarily descriptive. Keep in mind that the categories sometimes overlap.

Technical metadata defines the objects and processes in a DW/BI system, as seen from a technical point of view. The technical metadata includes the system metadata which defines the data structures such as: Tables, fields, data types, indexes and partitions in the relational engine, and databases, dimensions, measures, and data mining models. Technical metadata defines the data model and the way it is displayed for the users, with the reports, schedules, distribution lists and user security rights. Business metadata is content from the data warehouse described in more user-friendly terms. The business metadata tells you what data you have, where it comes from, what it means and what its relationship is to other data in the data warehouse. Business metadata may also serves as documentation for the DW/BI system. Users who

browse the data warehouse are primarily viewing the business metadata.

Process metadata is used to describe the results of various operations in the data warehouse. Within the ETL process all key data from tasks are logged on execution. This includes start time, end time, CPU seconds used, disk reads, disk writes and rows processed. When troubleshooting the ETL or query process, this sort of data becomes valuable. Process metadata is the fact measurement when building and using a DW/BI system. Some organizations make a living out of collecting and selling this sort of data to companies - in that case the process metadata becomes the business metadata for the fact and dimension tables. Process metadata is in interest of business people who can use the data to identify the users of their products, which products they are using and what level of service they are receiving.

Information schema
In relational databases, the information schema is an ANSI standard set of read-only views which provide information about all of the tables, views, columns, and procedures in a database. It can be used as a source of the information which some databases make available through non-standard commands, such as the SHOW command of MySQL, the DESCRIBE command of Oracle, and the \d command of PostgreSQL. => select count(table_name) from information_schema.tables; count ------99 (1 row) => select column_name, data_type, column_default, is_nullable from information_schema.columns where table_name='alpha'; column_name | data_type | column_default | is_nullable -------------+-----------+---------------+------------foo | integer | | YES

bar | character | | YES (2 rows) => select * from information_schema.information_schema_catalog_name; catalog_name -------------johnd (1 row)

Database schema
A database schema (pronounced skee-ma, /ski.m/) of a database system is its structure described in a formal language supported by the database management system (DBMS) and refers to the organization of data to create a blueprint of how a database will be constructed (divided into database tables). The formal definition of database schema is a set of formulas (sentences) called integrity constraints imposed on a database. These integrity constraints ensure compatibility between parts of the schema. All constraints are expressible in the same language. A database can be considered a structure in realization of the database language.[1] The states of a created conceptual schema are transformed into an explicit mapping, the database schema. This describes how real world entities are modeled in the database. "A database schema specifies, based on the database administrator's knowledge of possible applications, the facts that can enter the database, or those of interest to the possible end-users."[2] The notion of a database schema plays the same role as the notion of theory in predicate calculus. A model of this theory closely corresponds to a database, which can be seen at any instant of time as a mathematical object. Thus a schema can contain formulas representing integrity constraints specifically for an application and the constraints specifically for a type of database, all expressed in the same database language.[1] In a relational database, the schema defines the tables, fields, relationships, views, indexes, packages, procedures, functions, queues, triggers, types, sequences, materialized views, synonyms, database links, directories, Java, XML schemas, and other elements. Schemas are generally stored in a data dictionary. Although a schema is defined in text database language, the term is often used to refer to a graphical depiction of the database structure. In other words, schema is the structure of the database that defines the objects in the database.

In an Oracle Database system, the term "schema" has a slightly different connotation. For the interpretation used in an Oracle Database, see schema object. Levels of database schema Conceptual schema, a map of concepts and their relationships. Logical schema, a map of entities and their attributes and relations Physical schema, a particular implementation of a logical schema Schema object, Oracle database object Ideal requirements for schema integration Completeness All information in the source data should be included in the database schema.[3] [edit] Overlap preservation Each of the overlapping elements specified in the input mapping is also in a database schema relation.[3] [edit] Extended overlap preservation Source-specific elements that are associated with a sources overlapping elements are passed through to the database schema.[3] [edit] Normalization Main article: Database normalization Independent entities and relationships in the source data should not be grouped together in the same relation in the database schema. In particular, source specific schema elements should not be grouped with overlapping schema elements, if the grouping co-locates independent entities or relationships.[3] [edit] Minimality If any elements of the database schema are dropped then the database schema is not ideal.[3] These requirements influence the detailed structure of schemas that are

produced. Certain applications will not require that all of these conditions are met, but these five requirements are the most ideal. Example of two schema integrations
Example: Suppose we want a mediated (database) schema to integrate two travel databases, Go-travel and Ok-travel. Go-travel has three relations: Go-flight(f-num, time, meal) Go-price(f-num, date, price) Go-airline(airline, phone) The attribute f-num is the flight number and meal is a boolean. The other attributes are selfexplanatory. Ok-travel has just one relation: Ok-flight(f-num, date, time, price, nonstop) 'nonstop' is a boolean. The overlapping information in Ok-travels and Go-travels schemas could be represented in a mediated schema: Flight(f-num, date, time, price)[3]

Online transaction processing

Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions.[1] OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application. Requirements OLTP is a methodology to provide end users with access to large amounts of data in an intuitive and rapid manner to assist with deductions based on investigative reasoning. Online transaction processing increasingly requires support for transactions

that span a network and may include more than one company. For this reason, new online transaction processing software uses client or server processing and brokering software that allows transactions to run on different computer platforms in a network. In large applications, efficient OLTP may depend on sophisticated transaction management software (such as CICS) and/or database optimization tactics to facilitate the processing of large numbers of concurrent updates to an OLTP-oriented database. For even more demanding Decentralized database systems, OLTP brokering programs can distribute transaction processing among multiple computers on a network. OLTP is often integrated into service-oriented architecture (SOA) and Web services. [edit] Benefits Online Transaction Processing has two key benefits: simplicity and efficiency. Reduced paper trails and the faster, more accurate forecasts for revenues and expenses are both examples of how OLTP makes things simpler for businesses. [edit] Disadvantages As with any information processing system, security and reliability are important considerations. When organizations choose to rely on OLTP, operations can be severely impacted if the transaction system or database is unavailable due to data corruption, systems failure, or network availability issues. Additionally, like many modern online information technology solutions, some systems require offline maintenance which further affects the cost-benefit analysis.

Online analytical processing

Jump to: navigation, search

OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining.

"Key"pHYPERLINK "/wiki/Wikipedia:IPA_for_English"/), is an approach to swiftly answer multi-dimensional analytical (MDA) queries.[1] OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining.[2] Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM),[3] budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture.[4] The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).[5] OLAP tools enable users to interactively analyze multidimensional data from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.[6] Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends. In contrast, the drill-down is a technique that allows users to navigate through the details. For instance, users can access to the sales by individual products that make up a regions sales. Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the cube and view (dicing) the slices from different viewpoints. Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. [7] They borrow aspects of navigational databases, hierarchical databases and relational databases. The core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the

dimension tables. Each measure can be thought of as having a set of labels, or metadata associated with it. A dimension is what describes these labels; it provides information about the measure. A simple example would be a cube that contains a store's sales as a measure, and Date/Time as a dimension. Each Sale has a Date/Time label that describes more about that sale. Any number of dimensions can be added to the structure such as Store, Cashier, or Customer by adding a foreign key column to the fact table. This allows an analyst to view the measures along any combination of the dimensions.
For example:
Sales Fact Table +-------------+----------+ | sale_amount | time_id | +-------------+----------+ Time Dimension | 2008.10| 1234 |---+ +---------+-------------------+ +-------------+----------+ | | time_id | timestamp | | +---------+-------------------+ +---->| 1234 | 20080902 12:35:43 | +---------+-------------------+

Database normalization
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships. Independent entities and relationships in the source data should not be grouped together in the same relation in the database schema. In particular, source specific schema elements should not be grouped with overlapping schema elements, if the grouping co-locates independent entities or relationships.


In computing, denormalization is the process of attempting to optimise the read performance of a database by adding redundant data or by grouping data.[1]HYPERLINK \l "cite_note-1"[2] In some cases, denormalisation helps cover up the inefficiencies inherent in relational database software. A relational normalised database imposes a heavy access load over physical storage of data even if it is well tuned for high performance. A normalised design will often store different but related pieces of information in separate logical tables (called relations). If these relations are stored physically as separate disk files, completing a database query that draws information from several relations (a join operation) can be slow. If many relations are joined, it may be prohibitively slow. There are two strategies for dealing with this. The preferred method is to keep the logical design normalised, but allow the database management system (DBMS) to store additional redundant information on disk to optimise query response. In this case it is the DBMS software's responsibility to ensure that any redundant copies are kept consistent. This method is often implemented in SQL as indexed views (Microsoft SQL Server) or materialised views (Oracle). A view represents information in a format convenient for querying, and the index ensures that queries against the view are optimised. The more usual approach is to denormalise the logical data design. With care this can achieve a similar improvement in query response, but at a costit is now the database designer's responsibility to ensure that the denormalised database does not become inconsistent. This is done by creating rules in the database called constraints, that specify how the redundant copies of information must be kept synchronised. It is the increase in logical complexity of the database design and the added complexity of the additional constraints that make this approach hazardous. Moreover, constraints introduce a trade-off, speeding up reads (SELECT in SQL) while slowing down writes (INSERT, UPDATE, and DELETE). This means a denormalised database under heavy write load may actually offer worse performance than its functionally equivalent normalised counterpart. A denormalised data model is not the same as a data model that has not been normalised, and denormalisation should only take place after a satisfactory level of normalisation has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

Examples of denormalisation techniques include:

Materialised views, which may implement the following: Storing the count of the "many" objects in a one-to-many relationship as an attribute of the "one" relation Adding attributes to a relation from another relation with which it will be joined

Star schemas, which are also known as fact-dimension models and have been extended to snowflake schemas

Prebuilt summarisation or OLAP cubes Denormalisation techniques are often used to improve the scalability of Web applications.[3] SQL

SQL sometimes referred to as Structured Query Language SQL is a programming language designed for managing data in relational database management systems (RDBMS).its scope includes data insert, query, update and delete, schema creation and modification, and data access control. The SQL language is subdivided into several language elements, including:

Clauses, which are constituent components of statements and queries. (In some cases, these are optional.)[11] Expressions, which can produce either scalar values or tables consisting of columns and rows of data. Predicates, which specify conditions that can be evaluated to SQL three-valued logic (3VL) or Boolean (true/false/unknown) truth values and which are used to limit the effects of statements and queries, or to change program flow. Queries, which retrieve the data based on specific criteria. This is the most important element of SQL. Statements, which may have a persistent effect on schemata and data,

or which may control transactions, program flow, connections, sessions, or diagnostics.

SQL statements also include the semicolon (";") statement terminator. Though not required on every platform, it is defined as a standard part of the SQL grammar.

Insignificant whitespace is generally ignored in SQL statements and queries, making it easier to format SQL code for readability.

Queries The most common operation in SQL is the query, which is performed with the declarative SELECT statement. SELECT retrieves data from one or more tables, or expressions. Standard SELECT statements have no persistent effects on the database. Some non-standard implementations of SELECT can have persistent effects, such as the SELECT INTO syntax that exists in some databases.[12] Queries allow the user to describe desired data, leaving the database management system (DBMS) responsible for planning, optimizing, and performing the physical operations necessary to produce that result as it chooses. A query includes a list of columns to be included in the final result immediately following the SELECT keyword. An asterisk ("*") can also be used to specify that the query should return all columns of the queried tables. SELECT is the most complex statement in SQL, with optional keywords and clauses that include:

The FROM clause which indicates the table(s) from which data is to be retrieved. The FROM clause can include optional JOIN subclauses to specify the rules for joining tables. The WHERE clause includes a comparison predicate, which restricts the rows returned by the query. The WHERE clause eliminates all rows from the result set for which the comparison predicate does not evaluate to True. The GROUP BY clause is used to project rows having common values into a smaller set of rows. GROUP BY is often used in conjunction with SQL aggregation functions or to eliminate duplicate rows from a

result set. The WHERE clause is applied before the GROUP BY clause.

The HAVING clause includes a predicate used to filter rows resulting from the GROUP BY clause. Because it acts on the results of the GROUP BY clause, aggregation functions can be used in the HAVING clause predicate. The ORDER BY clause identifies which columns are used to sort the resulting data, and in which direction they should be sorted (options are ascending or descending). Without an ORDER BY clause, the order of rows returned by an SQL query is undefined.

The following is an example of a SELECT query that returns a list of expensive books. The query retrieves all rows from the Book table in which the price column contains a value greater than 100.00. The result is sorted in ascending order by title. The asterisk (*) in the select list indicates that all columns of the Book table should be included in the result set. SELECT * FROM Book WHERE price > 100.00 ORDER BY title; The example below demonstrates a query of multiple tables, grouping, and aggregation, by returning a list of books and the number of authors associated with each book. SELECT Book.title, COUNT(*) AS Authors FROM Book JOIN Book_author ON Book.isbn = Book_author.isbn GROUP BY Book.title; Example output might resemble the following: Title Authors ---------------------- ------SQL Examples and Guide 4 The Joy of SQL 1 An Introduction to SQL 2 Pitfalls of SQL 1 Under the precondition that isbn is the only common column name of the two tables and that a column named title only exists in the Books table, the above query could be rewritten in the following form:

SELECT title, COUNT(*) AS Authors FROM Book NATURAL JOIN Book_author GROUP BY title; However, many vendors either do not support this approach, or require certain column naming conventions in order for natural joins to work effectively. SQL includes operators and functions for calculating values on stored values. SQL allows the use of expressions in the select list to project data, as in the following example which returns a list of books that cost more than 100.00 with an additional sales_tax column containing a sales tax figure calculated at 6% of the price. SELECT isbn, title, price, price * 0.06 AS sales_tax FROM Book WHERE price > 100.00 ORDER BY title; [edit] Subqueries Queries can be nested so that the results of one query can be used in another query via a relational operator or aggregation function. A nested query is also known as a subquery. While joins and other table operations provide computationally superior (i.e. faster) alternatives in many cases, the use of subqueries introduces a hierarchy in execution which can be useful or necessary. In the following example, the aggregation function AVG receives as input the result of a subquery: SELECT isbn, title, price FROM Book WHERE price < AVG(SELECT price FROM Book) ORDER BY title; [edit] Null and three-valued logic (3VL) The idea of Null was introduced into SQL to handle missing information in the relational model. The introduction of Null (or Unknown) along with True and False is the foundation of three-valued logic. Null does not have a value (and is not a member of any data domain) but is rather a placeholder or

"mark" for missing information. Therefore comparisons with Null can never result in either True or False but always in the third logical result.[13] SQL uses Null to handle missing information. It supports three-valued logic (3VL) and the rules governing SQL three-valued logic are shown below (p and q represent logical states).[14] The word NULL is also a reserved keyword in SQL, used to identify the Null special marker. Additionally, since SQL operators return Unknown when comparing anything with Null, SQL provides two Null-specific comparison predicates: IS NULL and IS NOT NULL test whether data is or is not Null.[13] Note that SQL returns only results for which the WHERE clause returns a value of True; i.e. it excludes results with values of False and also excludes those whose value is Unknown. p AN Dq p True q True True False False False Unknow n True p OR q p True False Unknow n True Unknown

False Unknown q False False

True True

False True False

Unknow Unknown False Unknown n

Unknow True Unknown Unknown n


p= q

p True False False True Unknown Unknown Unknown

True False

False True

True False

True False

Unknown Unknown

Unknown Unknown Unknown Unknown

Universal quantification is not explicitly supported by SQL, and must be worked out as a negated existential quantification.

There is also the "<row value expression> IS DISTINCT FROM <row value expression>" infixed comparison operator which returns TRUE unless both operands are equal or both are NULL. Likewise, IS NOT DISTINCT FROM is defined as "NOT (<row value expression> IS DISTINCT FROM <row value expression>)".

UPDATE modifies a set of existing table rows, e.g.,: UPDATE My_table SET field1 = 'updated value' WHERE field2 = 'N';

DELETE removes existing rows from a table, e.g.,: DELETE FROM My_table WHERE field2 = 'N';

MERGE is used to combine the data of multiple tables. It combines the INSERT and UPDATE elements. It is defined in the SQL:2003 standard; prior to that, some databases provided similar functionality via different syntax, sometimes called "upsert".

[edit] Transaction controls Transactions, if available, wrap DML operations:

START TRANSACTION (or BEGIN WORK, or BEGIN TRANSACTION, depending on SQL dialect) mark the start of a database transaction, which either completes entirely or not at all. SAVE TRANSACTION (or SAVEPOINT ) save the state of the database at the current point in transaction

CREATE TABLE tbl_1(id INT); INSERT INTO tbl_1(id) VALUES(1); INSERT INTO tbl_1(id) VALUES(2); COMMIT; UPDATE tbl_1 SET id=200 WHERE id=1; SAVEPOINT id_1upd; UPDATE tbl_1 SET id=1000 WHERE id=2; ROLLBACK TO id_1upd; SELECT id FROM tbl_1;

COMMIT causes all data changes in a transaction to be made permanent. causes all data changes since the last COMMIT or ROLLBACK to be discarded, leaving the state of the data as it was prior to those changes.

Once the COMMIT statement completes, the transaction's changes cannot be rolled back. COMMIT and ROLLBACK terminate the current transaction and release data locks. In the absence of a START TRANSACTION or similar statement, the semantics of SQL are implementation-dependent. Example: A classic bank transfer of funds transaction. START TRANSACTION; UPDATE Account SET amount=amount-200 WHERE account_number=1234; UPDATE Account SET amount=amount+200 WHERE account_number=2345; IF ERRORS=0 COMMIT; IF ERRORS<>0 ROLLBACK; [edit] Data definition The Data Definition Language (DDL) manages table and index structure. The most basic items of DDL are the CREATE, ALTER, RENAME, DROP and TRUNCATE statements: CREATE creates an object (a table, for example) in the database, e.g.,: CREATE TABLE My_table( my_field1 INT,


my_field2 VARCHAR(50), my_field3 DATE NOT NULL, PRIMARY KEY (my_field1, my_field2)

ALTER modifies the structure of an existing object in various ways, for example, adding a column to an existing table or a constraint, e.g.,: ALTER TABLE My_table ADD my_field4 NUMBER(3) NOT NULL; TRUNCATE deletes all data from a table in a very fast way, deleting the data inside the table and not the table itself. It usually implies a subsequent COMMIT operation, i.e., it cannot be rolled back.(data is not written to the logs for rollback later, unlike DELETE ) TRUNCATE TABLE My_table;

DROP deletes an object in the database, usually irretrievably, i.e., it cannot be rolled back, e.g.,: DROP TABLE My_table;

[edit] Data types Each column in an SQL table declares the type(s) that column may contain. ANSI SQL includes the following data types.[18] [edit] Character strings

CHARACTER(n) or CHAR(n) fixed-width n-character string, padded with spaces as needed CHARACTER VARYING(n) or VARCHAR(n) variable-width string with a maximum size of n characters NATIONAL CHARACTER(n) or NCHAR(n) fixed width string supporting an international character set NATIONAL CHARACTER VARYING(n) or NVARCHAR(n) variable-width NCHAR string

[edit] Bit strings

BIT(n) an array of n bits BIT VARYING(n) an array of up to n bits

[edit] Numbers

INTEGER and SMALLINT FLOAT, REAL and DOUBLE PRECISION NUMERIC(precision, scale) or DECIMAL(precision, scale)

For example, the number 123.45 has a precision of 5 and a scale of 2. The precision is a positive integer that determines the number of significant digits in a particular radix (binary or decimal). The scale is a non-negative integer. A scale of 0 indicates that the number is an integer. For a decimal number with scale S, the exact numeric value is the integer value of the significant digits divided by 10S. SQL provides a function to round numerics or dates, called TRUNC (in Informix, DB2, PostgreSQL, Oracle and MySQL) or ROUND (in Informix, Sybase, Oracle, PostgreSQL and Microsoft SQL Server)[19] [edit] Date and time

DATE for date values (e.g., 2011-05-03) TIME for time values (e.g., 15:51:36). The granularity of the time value is usually a tick (100 nanoseconds). TIME WITH TIME ZONE or TIMETZ the same as TIME, but including details about the time zone in question. TIMESTAMP This is a DATE and a TIME put together in one variable (e.g., 2011-05-03 15:51:36). TIMESTAMP WITH TIME ZONE or TIMESTAMPTZ the same as TIMESTAMP, but including details about the time zone in question.

SQL provides several functions for generating a date / time variable out of a date / time string (TO_DATE, TO_TIME, TO_TIMESTAMP), as well as for extracting the respective members (seconds, for instance) of such variables. The current system date / time of the database server can be called by using functions like NOW.

[edit] Data control The Data Control Language (DCL) authorizes users and groups of users to access and manipulate data. Its two main statements are:

GRANT authorizes one or more users to perform an operation or a set of operations on an object. REVOKE eliminates a grant, which may be the default grant.

Example: GRANT SELECT, UPDATE ON My_table TO some_user, another_user; REVOKE SELECT, UPDATE ON My_table FROM some_user, another_user;

