Professional Documents
Culture Documents
Datallegro WP
Datallegro WP
By Tom Coffing
Page 1
Tom Coffing is one of the leading experts on both Teradata and DATAllegro. Tom has written over 20 books on data warehousing, Teradata, and DATAllegro. Tom has taught over 1,000 Teradata classes and is considered one of the best technical speakers and writers in the industry. Tom founded Coffing Data Warehousing 15 years ago and is the Chief Executive Officer (CEO) and President. Coffing Data Warehousing performs training, consulting, professional services, and has written almost 90 percent of the books on Teradata. Many people know Tom Coffing as Tera-Tom. Tom Coffings teams of developers are the producers of the Nexus, which is considered the Rosetta Stone of data warehouse software. Not only is it considered the best query tool, but it has been designed to work with Teradata, IBM, Oracle, SQL Server, DATAllegro, Netezza, and Greenplum systems. The Nexus is also a DBA dream tool because it is has point-and-click capabilities for compression, replicating data and DDL between different systems, comparing systems at the database, table, and data level and synchronizing the results. The Nexus is also used to build load scripts and to schedule queries and batch jobs. The Nexus has been tuned to make Teradata and DATAllegro co-existence a breeze. Tom became a DATAllegro partner over a year ago. Please feel free to contact Tom at Tom.Coffing@CoffingDW.com . To download a free trial of the Nexus you can do so at the Coffing Data Warehousing website at www.CoffingDW.com.
Page 2
Page 3
Page 4
When Teradata landed the data on the disks and rowed the data ashore, it discovered BI. We called it decision support systems (DSS) back then.
Sequential Processing
Page 5
Page 6
Teradata Strengths
He who asks a question may be a fool for five minutes, but he who never asks a question remains a fool forever.
Unknown
Teradata allows customers the ability to ask any question on any data at any time. The strengths of its system includes: Parallel Processing - Teradata designed its system to load data in parallel, query data in parallel and back-up data in parallel. This provided for a great deal of power and performance. Linear Scalability - Most Teradata customers made relatively small first purchases, but these customers were able to linearly grow their systems indefinitely. This means data, users, and applications. Load Utilities - Teradata produced FastLoad, MultiLoad, and FastExport to move data on and off of the Teradata system in blocks, thus providing for extremely fast load speeds. Experienced Optimizer - Teradata continually enhanced its Optimizer to come up with the fastest and quickest plan for accessing the data. Mixed workload of Queries - Teradata has done a great job of building a system that can handle long intensive queries as well as short sub-second queries. Active Data Warehousing - Teradata has done a nice job of building Active Data Warehouses. These are warehouses that can support BI and take a companys transactions and immediately place them in the warehouse for analysis. Workload Management - Teradata has done a solid job of managing query workloads through software and business rules. Lower priority queries are delayed, placed in a queue, or saved for batch processing in order to not overburden the system.
Page 7
Those who cannot remember the past are condemned to repeat it.
George Santayana, poet and philosopher
A man was driving on the highway when he got a call from his wife. She said, Be careful Bill the news just reported some idiot driving the wrong way on the highway. He said, It is worse than that. It isnt just one car there are hundreds of them! The data warehouse direction five years ago is much different than today, but some customers continue to head the wrong way down the data highway, and an expensive traffic jam is occurring. Teradata creators entered data warehousing because they thought they could build systems that were 10 times faster and cheaper than the mainframe. Now, Teradata systems are known for their enormous expense. The hardware and software are extremely expensive, but the prices charged for education, professional services and consulting have made customers re-think their enterprise data warehouse (EDW) strategy. It is funny how history repeats itself. Those who cannot remember the past are condemned to repeat payments! Teradata systems made sense 10 years ago. The hardware was expensive to produce, but without serious competition, it was the freeway of choice. Plus, parallel processing was considered a near impossible feat. A scalable centralized data warehouse provided a company one version of the truth and the price for the competitive edge was worth it. Here is Teradatas dilemma. All data warehouse vendors today use parallel processing! Each vendor now uses commodity hardware, and each can linearly scale their systems to petabytes of data. Each vendor handles a mixed workload, and each has developed effective and speedy load strategies. It is like parity in the National Football League. Last years champs might not make the playoffs the following year. Todays data warehouse is more complicated than ever before. Geographic locations in different time zones need to access the warehouse at their peak times. Ad hoc queries make it difficult for tuning, and enormous pressure is placed on the warehouse to produce thousands of reports. Couple this with the fact that some departments need to access data that is three months old while others need to access data that is three years old. Also understand that a mixed variety of logical models are now used within a data warehouse. All of these queries are competing for CPU, memory and disk. It just doesnt make logistical sense to try and force it all onto one centralized platform.
Page 8
Page 9
Paying Top Prices for every Query wont kill you. But why take the chance?
Tera-Tom Coffing
All people are created equal, but all data is not! I trained a major Fortune 35 insurance company for years on Teradata. Theyve been one of Teradatas biggest supporters. I was surprised when they told me they had decided to do a proof-of-concept on the Appliance vendors. I have great respect for their data warehouse director. He stated, To me, Teradata is the corporate jet of data warehousing, but you dont have to take the corporate jet everywhere you go. Because Teradata was born to be parallel this means that every request for data has to take the corporate jet. Teradata is the most expensive data warehouse solution in the world. Its architecture requires the entire system to work as a single expensive entity. Every query, data load, batch window process, or touch of a Teradata warehouse has an effect on the entire system. Think about it. Imagine if everywhere you traveled you took a corporate jet. It would be fantastic for international travel and long trips, but what about when traveling short distances? Do you really want to take the jet for a 10 minute drive downtown or even across the state? Appliances have the speed of a corporate jet but the price of a taxi ride. A distributed system that separates certain data allows for treatment of that particular data in the best and most effective manner. Creating a data warehouse environment that allows for multiple travel modes and budgetary options is the data warehouse of the future. Todays data needs to travel by corporate jet, car, bus, truck, train, bike, motorcycle and tricycle. Newer vendors are able to produce corporate jet speed but have designs that allow for more options and cheaper solutions.
Page 10
"Time is the coin of your life. It is the only coin you have, and only you can determine how it will be spent. Be careful lest you let other people spend it for you."
Carl Sandburg, poet
The quote above is beautiful and should strike a chord with every human being. Let me convert this beautiful quote into language a data warehouse can understand.
"Night time is the coin of your life. It is the best time to load data. Screw this up and the business will no longer make cents!"
Mia Batchwindows To-Longfellow
Large companies with a centralized EDW are beginning to have problems meeting their batch window. Again the problem is that Teradata systems are an all-or-nothing proposition. The great news is that Teradata has had 20 years to build its load utilities. The companys fastest three utilities move data at the block level with great speed. It has FastLoad to insert data into empty tables, MultiLoad to perform Inserts, updates, and deletes on populated tables, and FastExport to move data off of Teradata. Here is the problem. Teradata only allows up to 15 loads to be run at one time, and since these loads are so resource intensive, many companies set their maximum at five! With this limitation, it is only a matter of time (and growth) until the batch window cant be met. Did someone really decide to dedicate their batch window to one enterprise data warehouse? It comes down to simple math. As a cowboy in the Wild West once said: Never insult seven men when all youre packing is a six gun! Do the math.
2008 Coffing Data Warehouse. Page 11
Once the game is over, the king and the pawn go back in the same box.
Italian Proverb
Once the Teradata data warehouse game is on, the users query the detail data, data marts and older data on the same box. The data pawns get the same treatment as the king and queen. It is the exact same expense to query older data as newer data. Even data that doesnt have to be processed urgently is just as expensive as the most urgent of queries. And as they say in Australia, implementing new applications and data to the warehouse requires a big check mate. Teradata is conceptually much like a mainframe because it is expensive to purchase, upgrade and maintain. It is also provides for slow response to new data marts, and system performance is greatly effected as new data and users are added. It no longer makes financial or logistical sense to make every user query from one single centralized system. User needs are too diverse and companies are too large and complicated for one central and expensive solution. How are you going to satisfy such a wide variety of users? It is a lot of pressure to place on a system to handle all standard reports, batch loading mixed with near-real-time loads, tactical queries next to long detail data queries, CRM, ad hoc queries and data marts driving BI analytics. Add to the pressure that most companies are global and that each regions peak times for queries or loading conflict with each other.
Page 12
If I have seen farther than others, it is because I was standing on the shoulders of giants
Isaac Newton
Stuart Frost, CEO and founder of DATAllegro, took his years of experiences and built DATAllegro. This architecture provides power through parallel processing, but also provides a hub-and-spoke architecture that is flexible and cost-effective. The load capabilities are outstanding, the ability to handle a mixed workload is excellent, and the cost is extremely low.
Page 14
Two roads diverged in a wood and I took the one less traveled by, and that has made all the difference.
Robert Frost
A Teradata system processes data in parallel, but the system always spreads the data across all disks and acts as one big entity. Teradata designed its systems back in the 1980s and are constrained by the original design. Moving from a centralized solution to a distributed data warehouse will be the best road you can take. Giving a user the ability to query a road less traveled will make a huge difference in speed and performance.
Page 15
I saw the angel in the marble and carved until I set him free.
Michelangelo
DATAllegro saw the users in constraints and designed a warehouse that set them free. A DATAllegro system processes data in parallel, but the genius behind the design is the huband-spoke architecture. Dell servers can be grouped together to form a grid. Different grids are tied together, but act as separate systems. This has improved data loads, contention and the ability to quickly implement new data marts. A distributed environment allows you to separate newer data from older data, place data where it makes sense for different logistical time zones. You can free up resources on your Teradata system by allowing departments with different service level agreements (SLAs) to process on different logical platforms. The ability to provide different physical locations for sensitive, security risk data is a huge benefit. Handling mixed workloads such as tactical queries, BI reporting tools, ad hoc queries, geographical time zones, batch windows, continuous data loads and complex queries is not practical on a single platform.
Page 16
An invasion of armies can be resisted, but not an idea whose time has come.
Victor Hugo
Not paying an army and a leggy for a data warehouse is an idea whose time has come. Utilizing Dell servers, Cisco routers, EMC disks, and a proven open-source database in Ingres, DATAllegro has brought world-class commodity hardware and software together, and the invasion has begun. DATAllegro utilizes Ingres, which was one of the first relational database products of the 1990s. Built by legend Michael Stonebraker, Ingres has been enhanced by DATAllegro with an added software layer to create an MPP architecture to parallel queries across processors and to manage workloads. DATAllegro is extremely fast because of its parallel processing, but DATAllegro also has the ability to process data at different costs. Most companies keep data for years, but the latest three months of data is usually accessed much more heavily than the older data. It makes sense to process the most recent three months of data on your fastest system, but data that is rarely accessed should be processed at a much lower cost. Multi temperature data warehousing allows you to process certain data at higher speeds and costs and less active data at slower speeds and lower costs.
Page 17
A man who views the world at 50 the same as he did at 20 has wasted 30 years of his life.
Muhammad Ali
Teradata, a data warehouse pioneer, helped pave the way systems today are built. Teradata was designed 30 years ago and business intelligence has evolved. Did you know that 80 percent of most data warehouse queries can be satisfied by a data mart? The advantage of an appliance is that it is easy to install and has a rapid time-to-productivity. DATAllegro systems have been designed to provide the ability to handle detail data in their hub and have other data in a spoke. This is the best of both worlds. DATAllegro systems are true hub-andspoke distributed systems. This design is the data warehouse implementation of the future. Instead of having all data on a centralized system, why not divide data marts into different spokes? You can have data for one region, time zone or even continent in spokes that make sense. You might take all product data and place it separately from data that doesnt have a relation. This can all be done without duplicating data and allows companies a great deal more options other than throw it on the EDW.
Page 18
Page 19
Page 20
Page 21
Page 22
Use a query tool such as the Nexus to perform the SQL for each vendor you utilize. This tool was designed to work with every vendor and will allow for apples-to-apples comparisons in areas of timing comparisons, etc. Sign the appropriate paperwork to secure your data with each participating vendor and make sure the data is cleansed before and destroyed after the POC is run.
2008 Coffing Data Warehouse. Page 23
Use as much real data from your operational systems that makes sense. Dont let the vendors see the queries before the POC and then make sure the queries are run without any tuning. After queries are run let the vendors do some tuning. Tuning is something that can be important and you can compare how queries run in both a tuned and untuned environment. This will help you understand strengths and weaknesses of the product and where it fits best within your organization. Make sure you test almost every type of query including multi-table joins, left, right, and full outer joins, subqueries, correlated subqueries, OLAP functions, tactical short queries and long complex queries to ensure there are no surprises. Categorize the testing of queries by running easy, medium and difficult queries. Make sure to include a set of complex queries with a small result set. Test the queries with a single user, multiple users, and for the number of users that you would expect in your production environment. Tests approaching real-world are best and always test for linear scalability. Make sure that data loading is tested as part of the POC. This includes extraction from your current data warehouse and direct loads from operational systems. If continuous loads are expected in your production environment, then test for that. You should also test to see how queries perform as loads are running. Dont be afraid to run a few very complex queries that may or may not work with your current vendor. Demonstrate integration with your current warehouse such as Teradata, Oracle or IBM. Then test integration with your tools such as Informatica, Ab Initio, Business Objects, Cognos, SAS or Microstrategy. Dont forget to demonstrate integration with your backup and restore strategies. Benchmark at least two to five terabytes depending on your expected requirements. Make sure you test the system setup and administration. This will play a role tactically in understanding where the new system fits in your organizational plans. It also helps ensure no hidden costs. Make sure that each vendor does not have any advantages. Make apples-to-apples comparisons on vendor platforms and ensure a set of rules are followed. If a POC is to be performed offsite, then consider no access to the box unless your own people are present to monitor the POC. Test the ability to convert the table structures, data definition language (DDL) from your current data warehouse provider to the vendors DDL. Test current applications to see what SQL conversions might be needed for the new vendor.
Page 24
Hire or assign a project manager to run the POC in its entirety. You should be able to perform the actual POC within three to five days. Get pricing from each vendor, and remember that everything is negotiable.
Summary
Page 25
Thank you for reading this complimentary paper provided by DATAllegro. For more information on this topic or DATAllegro products, please contact us at (877) 499-3282, or visit www.datallegro.com
V E R S I O N 3
POWERED BY:
ABOUT DATALLEGRO
DATAllegro v3 is the industrys most advanced data warehouse appliance on an enterprise-class platform. By combining DATAllegros patent-pending software with the industrys leading hardware, storage and database technologies, DATAllegro has taken data warehouse performance and innovation to the next level. DATAllegro v3 goes beyond the low cost and high performance of first generation data warehouse appliances and adds the flexibility and scalability that only an open platform can offer. The result is a complete data warehouse appliance that enables companies with large volumes of data to increase their business intelligence. Whether you have a few terabytes of user data or hundreds, DATAllegros data warehouse appliances deliver a fast, flexible and affordable solution that allows a companys data to grow at the pace of its business.
85 Enterprise 2nd Floor Aliso Viejo, CA 92656 Sales (877) 499-3282 Phone (949) 680-3000 www.datallegro.com