51,50 €*
Versandkostenfrei per Post / DHL
Aktuell nicht verfügbar
This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions--before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
* Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
* Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
* Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
* Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"
Get the most out of Pentaho Kettle and your data warehousing with this detailed guide--from simple single table data migration to complex multisystem clustered data integration tasks.
This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions--before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
* Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
* Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
* Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
* Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"
Get the most out of Pentaho Kettle and your data warehousing with this detailed guide--from simple single table data migration to complex multisystem clustered data integration tasks.
Matt Casters is Founder of Kettle and works as Chief Data Integration at Pentaho, where he leads Kettle software development. Roland Bouman is an application developer focusing on open source web technology, databases, and business intelligence. Jos van Dongen is an independent business intelligence consultant and well-known author, analyst, and presenter.
Introduction xxxi
Part I Getting Started 1
Chapter 1 ETL Primer 3
OLTP versus Data Warehousing 3
What Is ETL? 5
The Evolution of ETL Solutions 5
ETL Building Blocks 7
ETL, ELT, and EII 8
ELT 9
EII: Virtual Data Integration 10
Data Integration Challenges 11
Methodology: Agile BI 12
ETL Design 14
Data Acquisition 14
Beware of Spreadsheets 15
Design for Failure 15
Change Data Capture 16
Data Quality 16
Data Profiling 16
Data Validation 17
ETL Tool Requirements 17
Connectivity 17
Platform Independence 18
Scalability 18
Design Flexibility 19
Reuse 19
Extensibility 19
Data Transformations 20
Testing and Debugging 21
Lineage and Impact Analysis 21
Logging and Auditing 22
Summary 22
Chapter 2 Kettle Concepts 23
Design Principles 23
The Building Blocks of Kettle Design 25
Transformations 25
Steps 26
Transformation Hops 26
Parallelism 27
Rows of Data 27
Data Conversion 29
Jobs 30
Job Entries 31
Job Hops 31
Multiple Paths and Backtracking 32
Parallel Execution 33
Job Entry Results 34
Transformation or Job Metadata 36
Database Connections 37
Special Options 38
The Power of the Relational Database 39
Connections and Transactions 39
Database Clustering 40
Tools and Utilities 41
Repositories 41
Virtual File Systems 42
Parameters and Variables 43
Defining Variables 43
Named Parameters 44
Using Variables 44
Visual Programming 45
Getting Started 46
Creating New Steps 47
Putting It All Together 49
Summary 51
Chapter 3 Installation and Configuration 53
Kettle Software Overview 53
Integrated Development Environment: Spoon 55
Command-Line Launchers: Kitchen and Pan 57
Job Server: Carte 57
[...] and [...] 58
Installation 58
Java Environment 58
Installing Java Manually 58
Using Your Linux Package Management System 59
Installing Kettle 59
Versions and Releases 59
Archive Names and Formats 60
Downloading and Uncompressing 60
Running Kettle Programs 61
Creating a Shortcut Icon or Launcher for Spoon 62
Configuration 63
Configuration Files and the .kettle Directory 63
The Kettle Shell Scripts 69
General Structure of the Startup Scripts 70
Adding an Entry to the Classpath 70
Changing the Maximum Heap Size 71
Managing JDBC Drivers 72
Summary 72
Chapter 4 An Example ETL Solution-Sakila 73
Sakila 73
The Sakila Sample Database 74
DVD Rental Business Process 74
Sakila Database Schema Diagram 75
Sakila Database Subject Areas 75
General Design Considerations 77
Installing the Sakila Sample Database 77
The Rental Star Schema 78
Rental Star Schema Diagram 78
Rental Fact Table 79
Dimension Tables 79
Keys and Change Data Capture 80
Installing the Rental Star Schema 81
Prerequisites and Some Basic Spoon Skills 81
Setting Up the ETL Solution 82
Creating Database Accounts 82
Working with Spoon 82
Opening Transformation and Job Files 82
Opening the Step's Configuration Dialog 83
Examining Streams 83
Running Jobs and Transformations 83
The Sample ETL Solution 84
Static, Generated Dimensions 84
Loading the dim_date Dimension Table 84
Loading the dim_time Dimension Table 86
Recurring Load 87
The load_rentals Job 88
The load_dim_staff Transformation 91
Database Connections 91
The load_dim_customer Transformation 95
The load_dim_store Transformation 98
The fetch_address Subtransformation 99
The load_dim_actor Transformation 101
The load_dim_film Transformation 102
The load_fact_rental Transformation 107
Summary 109
Part II ETL 111
Chapter 5 ETL Subsystems 113
Introduction to the 34 Subsystems 114
Extraction 114
Subsystems 1-3: Data Profiling, Change Data Capture, and
Extraction 115
Cleaning and Conforming Data 116
Subsystem 4: Data Cleaning and Quality Screen
Handler System 116
Subsystem 5: Error Event Handler 117
Subsystem 6: Audit Dimension Assembler 117
Subsystem 7: Deduplication System 117
Subsystem 8: Data Conformer 118
Data Delivery 118
Subsystem 9: Slowly Changing Dimension Processor 118
Subsystem 10: Surrogate Key Creation System 119
Subsystem 11: Hierarchy Dimension Builder 119
Subsystem 12: Special Dimension Builder 120
Subsystem 13: Fact Table Loader 121
Subsystem 14: Surrogate Key Pipeline 121
Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121
Subsystem 16: Late-Arriving Data Handler 122
Subsystem 17: Dimension Manager System 122
Subsystem 18: Fact Table Provider System 122
Subsystem 19: Aggregate Builder 123
Subsystem 20: Multidimensional (OLAP) Cube Builder 123
Subsystem 21: Data Integration Manager 123
Managing the ETL Environment 123
Summary 126
Chapter 6 Data Extraction 127
Kettle Data Extraction Overview 128
File-Based Extraction 128
Working with Text Files 128
Working with XML files 133
Special File Types 134
Database-Based Extraction 134
Web-Based Extraction 137
Text-Based Web Extraction 137
HTTP Client 137
Using SOAP 138
Stream-Based and Real-Time Extraction 138
Working with ERP and CRM Systems 138
ERP Challenges 139
Kettle ERP Plugins 140
Working with SAP Data 140
ERP and CDC Issues 146
Data Profiling 146
Using [...] DataCleaner 147
Adding Profile Tasks 149
Adding Database Connections 149
Doing an Initial Profile 151
Working with Regular Expressions 151
Profiling and Exploring Results 152
Validating and Comparing Data 153
Using a Dictionary for Column Dependency Checks 153
Alternative Solutions 154
Text Profiling with Kettle 154
CDC: Change Data Capture 154
Source Data-Based CDC 155
Trigger-Based CDC 157
Snapshot-Based CDC 158
Log-Based CDC 162
Which CDC Alternative Should You Choose? 163
Delivering Data 164
Summary 164
Chapter 7 Cleansing and Conforming 167
Data Cleansing 168
Data-Cleansing Steps 169
Using Reference Tables 172
Conforming Data Using Lookup Tables 172
Conforming Data Using Reference Tables 175
Data Validation 179
Applying Validation Rules 180
Validating Dependency Constraints 183
Error Handling 183
Handling Process Errors 184
Transformation Errors 186
Handling Data (Validation) Errors 187
Auditing Data and Process Quality 191
Deduplicating Data 192
Handling Exact Duplicates 193
The Problem of Non-Exact Duplicates 194
Building Deduplication Transforms 195
Step 1: Fuzzy Match 197
Step 2: Select Suspects 198
Step 3: Lookup Validation Value 198
Step 4: Filter Duplicates 199
Scripting 200
Formula 201
JavaScript 202
User-Defined Java Expressions 202
Regular Expressions 203
Summary 205
Chapter 8 Handling Dimension Tables 207
Managing Keys 208
Managing Business Keys 209
Keys in the Source System 209
Keys in the Data Warehouse 209
Business Keys 209
Storing Business Keys 210
Looking Up Keys with Kettle 210
Generating Surrogate Keys 210
The "Add sequence" Step 211
Working with auto_increment or IDENTITY Columns 217
Keys for Slowly Changing Dimensions 217
Loading Dimension Tables 218
Snowflaked Dimension Tables 218
Top-Down Level-Wise Loading 219
Sakila Snowflake Example 219
Sample Transformation 221
Database Lookup Configuration 222
Sample Job 225
Star Schema Dimension Tables 226
Denormalization 226
Denormalizing to 1NF with the "Database lookup" Step 226
Change Data Capture 227
Slowly Changing Dimensions 228
Types of Slowly Changing Dimensions 228
Type 1 Slowly Changing Dimensions 229
The Insert / Update Step 229
Type 2 Slowly Changing Dimensions 232
The "Dimension lookup / update" Step 232
Other Types of Slowly Changing Dimensions 237
Type 3 Slowly Changing Dimensions 237
Hybrid Slowly Changing Dimensions 238
More Dimensions 239
Generated Dimensions 239
Date and Time Dimensions 239
Generated Mini-Dimensions 239
Junk Dimensions 241
Recursive Hierarchies 242
Summary 243
Chapter 9 Loading Fact Tables 245
Loading in Bulk 246
STDIN and FIFO 247
Kettle Bulk Loaders 248
MySQL Bulk Loading 249
LucidDB Bulk Loader 249
Oracle Bulk Loader 249
PostgreSQL Bulk Loader 250
Table Output Step 250
General Bulk Load Considerations 250
Dimension Lookups 251
Maintaining Referential Integrity 251
The Surrogate Key Pipeline 252
Using In-Memory Lookups 253
Stream Lookups 253
Late-Arriving Data 255
Late-Arriving Facts 256
Late-Arriving Dimensions 256
Fact Table Handling 260
Periodic and Accumulating Snapshots 260
Introducing State-Oriented Fact Tables 261
Loading Periodic Snapshots 263
Loading Accumulating Snapshots...
Erscheinungsjahr: | 2010 |
---|---|
Genre: | Informatik |
Rubrik: | Naturwissenschaften & Technik |
Medium: | Taschenbuch |
Inhalt: | 720 S. |
ISBN-13: | 9780470635179 |
ISBN-10: | 0470635177 |
Sprache: | Englisch |
Herstellernummer: | 14563517000 |
Einband: | Kartoniert / Broschiert |
Autor: |
Casters, Matt
Bouman, Roland Dongen, Jos Van |
Hersteller: |
Wiley
John Wiley & Sons |
Maße: | 235 x 191 x 39 mm |
Von/Mit: | Matt Casters (u. a.) |
Erscheinungsdatum: | 28.09.2010 |
Gewicht: | 1,323 kg |
Matt Casters is Founder of Kettle and works as Chief Data Integration at Pentaho, where he leads Kettle software development. Roland Bouman is an application developer focusing on open source web technology, databases, and business intelligence. Jos van Dongen is an independent business intelligence consultant and well-known author, analyst, and presenter.
Introduction xxxi
Part I Getting Started 1
Chapter 1 ETL Primer 3
OLTP versus Data Warehousing 3
What Is ETL? 5
The Evolution of ETL Solutions 5
ETL Building Blocks 7
ETL, ELT, and EII 8
ELT 9
EII: Virtual Data Integration 10
Data Integration Challenges 11
Methodology: Agile BI 12
ETL Design 14
Data Acquisition 14
Beware of Spreadsheets 15
Design for Failure 15
Change Data Capture 16
Data Quality 16
Data Profiling 16
Data Validation 17
ETL Tool Requirements 17
Connectivity 17
Platform Independence 18
Scalability 18
Design Flexibility 19
Reuse 19
Extensibility 19
Data Transformations 20
Testing and Debugging 21
Lineage and Impact Analysis 21
Logging and Auditing 22
Summary 22
Chapter 2 Kettle Concepts 23
Design Principles 23
The Building Blocks of Kettle Design 25
Transformations 25
Steps 26
Transformation Hops 26
Parallelism 27
Rows of Data 27
Data Conversion 29
Jobs 30
Job Entries 31
Job Hops 31
Multiple Paths and Backtracking 32
Parallel Execution 33
Job Entry Results 34
Transformation or Job Metadata 36
Database Connections 37
Special Options 38
The Power of the Relational Database 39
Connections and Transactions 39
Database Clustering 40
Tools and Utilities 41
Repositories 41
Virtual File Systems 42
Parameters and Variables 43
Defining Variables 43
Named Parameters 44
Using Variables 44
Visual Programming 45
Getting Started 46
Creating New Steps 47
Putting It All Together 49
Summary 51
Chapter 3 Installation and Configuration 53
Kettle Software Overview 53
Integrated Development Environment: Spoon 55
Command-Line Launchers: Kitchen and Pan 57
Job Server: Carte 57
[...] and [...] 58
Installation 58
Java Environment 58
Installing Java Manually 58
Using Your Linux Package Management System 59
Installing Kettle 59
Versions and Releases 59
Archive Names and Formats 60
Downloading and Uncompressing 60
Running Kettle Programs 61
Creating a Shortcut Icon or Launcher for Spoon 62
Configuration 63
Configuration Files and the .kettle Directory 63
The Kettle Shell Scripts 69
General Structure of the Startup Scripts 70
Adding an Entry to the Classpath 70
Changing the Maximum Heap Size 71
Managing JDBC Drivers 72
Summary 72
Chapter 4 An Example ETL Solution-Sakila 73
Sakila 73
The Sakila Sample Database 74
DVD Rental Business Process 74
Sakila Database Schema Diagram 75
Sakila Database Subject Areas 75
General Design Considerations 77
Installing the Sakila Sample Database 77
The Rental Star Schema 78
Rental Star Schema Diagram 78
Rental Fact Table 79
Dimension Tables 79
Keys and Change Data Capture 80
Installing the Rental Star Schema 81
Prerequisites and Some Basic Spoon Skills 81
Setting Up the ETL Solution 82
Creating Database Accounts 82
Working with Spoon 82
Opening Transformation and Job Files 82
Opening the Step's Configuration Dialog 83
Examining Streams 83
Running Jobs and Transformations 83
The Sample ETL Solution 84
Static, Generated Dimensions 84
Loading the dim_date Dimension Table 84
Loading the dim_time Dimension Table 86
Recurring Load 87
The load_rentals Job 88
The load_dim_staff Transformation 91
Database Connections 91
The load_dim_customer Transformation 95
The load_dim_store Transformation 98
The fetch_address Subtransformation 99
The load_dim_actor Transformation 101
The load_dim_film Transformation 102
The load_fact_rental Transformation 107
Summary 109
Part II ETL 111
Chapter 5 ETL Subsystems 113
Introduction to the 34 Subsystems 114
Extraction 114
Subsystems 1-3: Data Profiling, Change Data Capture, and
Extraction 115
Cleaning and Conforming Data 116
Subsystem 4: Data Cleaning and Quality Screen
Handler System 116
Subsystem 5: Error Event Handler 117
Subsystem 6: Audit Dimension Assembler 117
Subsystem 7: Deduplication System 117
Subsystem 8: Data Conformer 118
Data Delivery 118
Subsystem 9: Slowly Changing Dimension Processor 118
Subsystem 10: Surrogate Key Creation System 119
Subsystem 11: Hierarchy Dimension Builder 119
Subsystem 12: Special Dimension Builder 120
Subsystem 13: Fact Table Loader 121
Subsystem 14: Surrogate Key Pipeline 121
Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121
Subsystem 16: Late-Arriving Data Handler 122
Subsystem 17: Dimension Manager System 122
Subsystem 18: Fact Table Provider System 122
Subsystem 19: Aggregate Builder 123
Subsystem 20: Multidimensional (OLAP) Cube Builder 123
Subsystem 21: Data Integration Manager 123
Managing the ETL Environment 123
Summary 126
Chapter 6 Data Extraction 127
Kettle Data Extraction Overview 128
File-Based Extraction 128
Working with Text Files 128
Working with XML files 133
Special File Types 134
Database-Based Extraction 134
Web-Based Extraction 137
Text-Based Web Extraction 137
HTTP Client 137
Using SOAP 138
Stream-Based and Real-Time Extraction 138
Working with ERP and CRM Systems 138
ERP Challenges 139
Kettle ERP Plugins 140
Working with SAP Data 140
ERP and CDC Issues 146
Data Profiling 146
Using [...] DataCleaner 147
Adding Profile Tasks 149
Adding Database Connections 149
Doing an Initial Profile 151
Working with Regular Expressions 151
Profiling and Exploring Results 152
Validating and Comparing Data 153
Using a Dictionary for Column Dependency Checks 153
Alternative Solutions 154
Text Profiling with Kettle 154
CDC: Change Data Capture 154
Source Data-Based CDC 155
Trigger-Based CDC 157
Snapshot-Based CDC 158
Log-Based CDC 162
Which CDC Alternative Should You Choose? 163
Delivering Data 164
Summary 164
Chapter 7 Cleansing and Conforming 167
Data Cleansing 168
Data-Cleansing Steps 169
Using Reference Tables 172
Conforming Data Using Lookup Tables 172
Conforming Data Using Reference Tables 175
Data Validation 179
Applying Validation Rules 180
Validating Dependency Constraints 183
Error Handling 183
Handling Process Errors 184
Transformation Errors 186
Handling Data (Validation) Errors 187
Auditing Data and Process Quality 191
Deduplicating Data 192
Handling Exact Duplicates 193
The Problem of Non-Exact Duplicates 194
Building Deduplication Transforms 195
Step 1: Fuzzy Match 197
Step 2: Select Suspects 198
Step 3: Lookup Validation Value 198
Step 4: Filter Duplicates 199
Scripting 200
Formula 201
JavaScript 202
User-Defined Java Expressions 202
Regular Expressions 203
Summary 205
Chapter 8 Handling Dimension Tables 207
Managing Keys 208
Managing Business Keys 209
Keys in the Source System 209
Keys in the Data Warehouse 209
Business Keys 209
Storing Business Keys 210
Looking Up Keys with Kettle 210
Generating Surrogate Keys 210
The "Add sequence" Step 211
Working with auto_increment or IDENTITY Columns 217
Keys for Slowly Changing Dimensions 217
Loading Dimension Tables 218
Snowflaked Dimension Tables 218
Top-Down Level-Wise Loading 219
Sakila Snowflake Example 219
Sample Transformation 221
Database Lookup Configuration 222
Sample Job 225
Star Schema Dimension Tables 226
Denormalization 226
Denormalizing to 1NF with the "Database lookup" Step 226
Change Data Capture 227
Slowly Changing Dimensions 228
Types of Slowly Changing Dimensions 228
Type 1 Slowly Changing Dimensions 229
The Insert / Update Step 229
Type 2 Slowly Changing Dimensions 232
The "Dimension lookup / update" Step 232
Other Types of Slowly Changing Dimensions 237
Type 3 Slowly Changing Dimensions 237
Hybrid Slowly Changing Dimensions 238
More Dimensions 239
Generated Dimensions 239
Date and Time Dimensions 239
Generated Mini-Dimensions 239
Junk Dimensions 241
Recursive Hierarchies 242
Summary 243
Chapter 9 Loading Fact Tables 245
Loading in Bulk 246
STDIN and FIFO 247
Kettle Bulk Loaders 248
MySQL Bulk Loading 249
LucidDB Bulk Loader 249
Oracle Bulk Loader 249
PostgreSQL Bulk Loader 250
Table Output Step 250
General Bulk Load Considerations 250
Dimension Lookups 251
Maintaining Referential Integrity 251
The Surrogate Key Pipeline 252
Using In-Memory Lookups 253
Stream Lookups 253
Late-Arriving Data 255
Late-Arriving Facts 256
Late-Arriving Dimensions 256
Fact Table Handling 260
Periodic and Accumulating Snapshots 260
Introducing State-Oriented Fact Tables 261
Loading Periodic Snapshots 263
Loading Accumulating Snapshots...
Erscheinungsjahr: | 2010 |
---|---|
Genre: | Informatik |
Rubrik: | Naturwissenschaften & Technik |
Medium: | Taschenbuch |
Inhalt: | 720 S. |
ISBN-13: | 9780470635179 |
ISBN-10: | 0470635177 |
Sprache: | Englisch |
Herstellernummer: | 14563517000 |
Einband: | Kartoniert / Broschiert |
Autor: |
Casters, Matt
Bouman, Roland Dongen, Jos Van |
Hersteller: |
Wiley
John Wiley & Sons |
Maße: | 235 x 191 x 39 mm |
Von/Mit: | Matt Casters (u. a.) |
Erscheinungsdatum: | 28.09.2010 |
Gewicht: | 1,323 kg |