Zum Hauptinhalt springen
Dekorationsartikel gehören nicht zum Leistungsumfang.
Pentaho Kettle Solutions
Building Open Source Etl Solutions with Pentaho Data Integration
Taschenbuch von Matt Casters (u. a.)
Sprache: Englisch

51,50 €*

inkl. MwSt.

Versandkostenfrei per Post / DHL

Aktuell nicht verfügbar

Kategorien:
Beschreibung
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions--before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
* Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
* Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
* Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
* Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"

Get the most out of Pentaho Kettle and your data warehousing with this detailed guide--from simple single table data migration to complex multisystem clustered data integration tasks.
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions--before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
* Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
* Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
* Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
* Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"

Get the most out of Pentaho Kettle and your data warehousing with this detailed guide--from simple single table data migration to complex multisystem clustered data integration tasks.
Über den Autor

Matt Casters is Founder of Kettle and works as Chief Data Integration at Pentaho, where he leads Kettle software development. Roland Bouman is an application developer focusing on open source web technology, databases, and business intelligence. Jos van Dongen is an independent business intelligence consultant and well-known author, analyst, and presenter.

Inhaltsverzeichnis

Introduction xxxi

Part I Getting Started 1

Chapter 1 ETL Primer 3

OLTP versus Data Warehousing 3

What Is ETL? 5

The Evolution of ETL Solutions 5

ETL Building Blocks 7

ETL, ELT, and EII 8

ELT 9

EII: Virtual Data Integration 10

Data Integration Challenges 11

Methodology: Agile BI 12

ETL Design 14

Data Acquisition 14

Beware of Spreadsheets 15

Design for Failure 15

Change Data Capture 16

Data Quality 16

Data Profiling 16

Data Validation 17

ETL Tool Requirements 17

Connectivity 17

Platform Independence 18

Scalability 18

Design Flexibility 19

Reuse 19

Extensibility 19

Data Transformations 20

Testing and Debugging 21

Lineage and Impact Analysis 21

Logging and Auditing 22

Summary 22

Chapter 2 Kettle Concepts 23

Design Principles 23

The Building Blocks of Kettle Design 25

Transformations 25

Steps 26

Transformation Hops 26

Parallelism 27

Rows of Data 27

Data Conversion 29

Jobs 30

Job Entries 31

Job Hops 31

Multiple Paths and Backtracking 32

Parallel Execution 33

Job Entry Results 34

Transformation or Job Metadata 36

Database Connections 37

Special Options 38

The Power of the Relational Database 39

Connections and Transactions 39

Database Clustering 40

Tools and Utilities 41

Repositories 41

Virtual File Systems 42

Parameters and Variables 43

Defining Variables 43

Named Parameters 44

Using Variables 44

Visual Programming 45

Getting Started 46

Creating New Steps 47

Putting It All Together 49

Summary 51

Chapter 3 Installation and Configuration 53

Kettle Software Overview 53

Integrated Development Environment: Spoon 55

Command-Line Launchers: Kitchen and Pan 57

Job Server: Carte 57

[...] and [...] 58

Installation 58

Java Environment 58

Installing Java Manually 58

Using Your Linux Package Management System 59

Installing Kettle 59

Versions and Releases 59

Archive Names and Formats 60

Downloading and Uncompressing 60

Running Kettle Programs 61

Creating a Shortcut Icon or Launcher for Spoon 62

Configuration 63

Configuration Files and the .kettle Directory 63

The Kettle Shell Scripts 69

General Structure of the Startup Scripts 70

Adding an Entry to the Classpath 70

Changing the Maximum Heap Size 71

Managing JDBC Drivers 72

Summary 72

Chapter 4 An Example ETL Solution-Sakila 73

Sakila 73

The Sakila Sample Database 74

DVD Rental Business Process 74

Sakila Database Schema Diagram 75

Sakila Database Subject Areas 75

General Design Considerations 77

Installing the Sakila Sample Database 77

The Rental Star Schema 78

Rental Star Schema Diagram 78

Rental Fact Table 79

Dimension Tables 79

Keys and Change Data Capture 80

Installing the Rental Star Schema 81

Prerequisites and Some Basic Spoon Skills 81

Setting Up the ETL Solution 82

Creating Database Accounts 82

Working with Spoon 82

Opening Transformation and Job Files 82

Opening the Step's Configuration Dialog 83

Examining Streams 83

Running Jobs and Transformations 83

The Sample ETL Solution 84

Static, Generated Dimensions 84

Loading the dim_date Dimension Table 84

Loading the dim_time Dimension Table 86

Recurring Load 87

The load_rentals Job 88

The load_dim_staff Transformation 91

Database Connections 91

The load_dim_customer Transformation 95

The load_dim_store Transformation 98

The fetch_address Subtransformation 99

The load_dim_actor Transformation 101

The load_dim_film Transformation 102

The load_fact_rental Transformation 107

Summary 109

Part II ETL 111

Chapter 5 ETL Subsystems 113

Introduction to the 34 Subsystems 114

Extraction 114

Subsystems 1-3: Data Profiling, Change Data Capture, and

Extraction 115

Cleaning and Conforming Data 116

Subsystem 4: Data Cleaning and Quality Screen

Handler System 116

Subsystem 5: Error Event Handler 117

Subsystem 6: Audit Dimension Assembler 117

Subsystem 7: Deduplication System 117

Subsystem 8: Data Conformer 118

Data Delivery 118

Subsystem 9: Slowly Changing Dimension Processor 118

Subsystem 10: Surrogate Key Creation System 119

Subsystem 11: Hierarchy Dimension Builder 119

Subsystem 12: Special Dimension Builder 120

Subsystem 13: Fact Table Loader 121

Subsystem 14: Surrogate Key Pipeline 121

Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121

Subsystem 16: Late-Arriving Data Handler 122

Subsystem 17: Dimension Manager System 122

Subsystem 18: Fact Table Provider System 122

Subsystem 19: Aggregate Builder 123

Subsystem 20: Multidimensional (OLAP) Cube Builder 123

Subsystem 21: Data Integration Manager 123

Managing the ETL Environment 123

Summary 126

Chapter 6 Data Extraction 127

Kettle Data Extraction Overview 128

File-Based Extraction 128

Working with Text Files 128

Working with XML files 133

Special File Types 134

Database-Based Extraction 134

Web-Based Extraction 137

Text-Based Web Extraction 137

HTTP Client 137

Using SOAP 138

Stream-Based and Real-Time Extraction 138

Working with ERP and CRM Systems 138

ERP Challenges 139

Kettle ERP Plugins 140

Working with SAP Data 140

ERP and CDC Issues 146

Data Profiling 146

Using [...] DataCleaner 147

Adding Profile Tasks 149

Adding Database Connections 149

Doing an Initial Profile 151

Working with Regular Expressions 151

Profiling and Exploring Results 152

Validating and Comparing Data 153

Using a Dictionary for Column Dependency Checks 153

Alternative Solutions 154

Text Profiling with Kettle 154

CDC: Change Data Capture 154

Source Data-Based CDC 155

Trigger-Based CDC 157

Snapshot-Based CDC 158

Log-Based CDC 162

Which CDC Alternative Should You Choose? 163

Delivering Data 164

Summary 164

Chapter 7 Cleansing and Conforming 167

Data Cleansing 168

Data-Cleansing Steps 169

Using Reference Tables 172

Conforming Data Using Lookup Tables 172

Conforming Data Using Reference Tables 175

Data Validation 179

Applying Validation Rules 180

Validating Dependency Constraints 183

Error Handling 183

Handling Process Errors 184

Transformation Errors 186

Handling Data (Validation) Errors 187

Auditing Data and Process Quality 191

Deduplicating Data 192

Handling Exact Duplicates 193

The Problem of Non-Exact Duplicates 194

Building Deduplication Transforms 195

Step 1: Fuzzy Match 197

Step 2: Select Suspects 198

Step 3: Lookup Validation Value 198

Step 4: Filter Duplicates 199

Scripting 200

Formula 201

JavaScript 202

User-Defined Java Expressions 202

Regular Expressions 203

Summary 205

Chapter 8 Handling Dimension Tables 207

Managing Keys 208

Managing Business Keys 209

Keys in the Source System 209

Keys in the Data Warehouse 209

Business Keys 209

Storing Business Keys 210

Looking Up Keys with Kettle 210

Generating Surrogate Keys 210

The "Add sequence" Step 211

Working with auto_increment or IDENTITY Columns 217

Keys for Slowly Changing Dimensions 217

Loading Dimension Tables 218

Snowflaked Dimension Tables 218

Top-Down Level-Wise Loading 219

Sakila Snowflake Example 219

Sample Transformation 221

Database Lookup Configuration 222

Sample Job 225

Star Schema Dimension Tables 226

Denormalization 226

Denormalizing to 1NF with the "Database lookup" Step 226

Change Data Capture 227

Slowly Changing Dimensions 228

Types of Slowly Changing Dimensions 228

Type 1 Slowly Changing Dimensions 229

The Insert / Update Step 229

Type 2 Slowly Changing Dimensions 232

The "Dimension lookup / update" Step 232

Other Types of Slowly Changing Dimensions 237

Type 3 Slowly Changing Dimensions 237

Hybrid Slowly Changing Dimensions 238

More Dimensions 239

Generated Dimensions 239

Date and Time Dimensions 239

Generated Mini-Dimensions 239

Junk Dimensions 241

Recursive Hierarchies 242

Summary 243

Chapter 9 Loading Fact Tables 245

Loading in Bulk 246

STDIN and FIFO 247

Kettle Bulk Loaders 248

MySQL Bulk Loading 249

LucidDB Bulk Loader 249

Oracle Bulk Loader 249

PostgreSQL Bulk Loader 250

Table Output Step 250

General Bulk Load Considerations 250

Dimension Lookups 251

Maintaining Referential Integrity 251

The Surrogate Key Pipeline 252

Using In-Memory Lookups 253

Stream Lookups 253

Late-Arriving Data 255

Late-Arriving Facts 256

Late-Arriving Dimensions 256

Fact Table Handling 260

Periodic and Accumulating Snapshots 260

Introducing State-Oriented Fact Tables 261

Loading Periodic Snapshots 263

Loading Accumulating Snapshots...

Details
Erscheinungsjahr: 2010
Genre: Informatik
Rubrik: Naturwissenschaften & Technik
Medium: Taschenbuch
Inhalt: 720 S.
ISBN-13: 9780470635179
ISBN-10: 0470635177
Sprache: Englisch
Herstellernummer: 14563517000
Einband: Kartoniert / Broschiert
Autor: Casters, Matt
Bouman, Roland
Dongen, Jos Van
Hersteller: Wiley
John Wiley & Sons
Maße: 235 x 191 x 39 mm
Von/Mit: Matt Casters (u. a.)
Erscheinungsdatum: 28.09.2010
Gewicht: 1,323 kg
Artikel-ID: 101166652
Über den Autor

Matt Casters is Founder of Kettle and works as Chief Data Integration at Pentaho, where he leads Kettle software development. Roland Bouman is an application developer focusing on open source web technology, databases, and business intelligence. Jos van Dongen is an independent business intelligence consultant and well-known author, analyst, and presenter.

Inhaltsverzeichnis

Introduction xxxi

Part I Getting Started 1

Chapter 1 ETL Primer 3

OLTP versus Data Warehousing 3

What Is ETL? 5

The Evolution of ETL Solutions 5

ETL Building Blocks 7

ETL, ELT, and EII 8

ELT 9

EII: Virtual Data Integration 10

Data Integration Challenges 11

Methodology: Agile BI 12

ETL Design 14

Data Acquisition 14

Beware of Spreadsheets 15

Design for Failure 15

Change Data Capture 16

Data Quality 16

Data Profiling 16

Data Validation 17

ETL Tool Requirements 17

Connectivity 17

Platform Independence 18

Scalability 18

Design Flexibility 19

Reuse 19

Extensibility 19

Data Transformations 20

Testing and Debugging 21

Lineage and Impact Analysis 21

Logging and Auditing 22

Summary 22

Chapter 2 Kettle Concepts 23

Design Principles 23

The Building Blocks of Kettle Design 25

Transformations 25

Steps 26

Transformation Hops 26

Parallelism 27

Rows of Data 27

Data Conversion 29

Jobs 30

Job Entries 31

Job Hops 31

Multiple Paths and Backtracking 32

Parallel Execution 33

Job Entry Results 34

Transformation or Job Metadata 36

Database Connections 37

Special Options 38

The Power of the Relational Database 39

Connections and Transactions 39

Database Clustering 40

Tools and Utilities 41

Repositories 41

Virtual File Systems 42

Parameters and Variables 43

Defining Variables 43

Named Parameters 44

Using Variables 44

Visual Programming 45

Getting Started 46

Creating New Steps 47

Putting It All Together 49

Summary 51

Chapter 3 Installation and Configuration 53

Kettle Software Overview 53

Integrated Development Environment: Spoon 55

Command-Line Launchers: Kitchen and Pan 57

Job Server: Carte 57

[...] and [...] 58

Installation 58

Java Environment 58

Installing Java Manually 58

Using Your Linux Package Management System 59

Installing Kettle 59

Versions and Releases 59

Archive Names and Formats 60

Downloading and Uncompressing 60

Running Kettle Programs 61

Creating a Shortcut Icon or Launcher for Spoon 62

Configuration 63

Configuration Files and the .kettle Directory 63

The Kettle Shell Scripts 69

General Structure of the Startup Scripts 70

Adding an Entry to the Classpath 70

Changing the Maximum Heap Size 71

Managing JDBC Drivers 72

Summary 72

Chapter 4 An Example ETL Solution-Sakila 73

Sakila 73

The Sakila Sample Database 74

DVD Rental Business Process 74

Sakila Database Schema Diagram 75

Sakila Database Subject Areas 75

General Design Considerations 77

Installing the Sakila Sample Database 77

The Rental Star Schema 78

Rental Star Schema Diagram 78

Rental Fact Table 79

Dimension Tables 79

Keys and Change Data Capture 80

Installing the Rental Star Schema 81

Prerequisites and Some Basic Spoon Skills 81

Setting Up the ETL Solution 82

Creating Database Accounts 82

Working with Spoon 82

Opening Transformation and Job Files 82

Opening the Step's Configuration Dialog 83

Examining Streams 83

Running Jobs and Transformations 83

The Sample ETL Solution 84

Static, Generated Dimensions 84

Loading the dim_date Dimension Table 84

Loading the dim_time Dimension Table 86

Recurring Load 87

The load_rentals Job 88

The load_dim_staff Transformation 91

Database Connections 91

The load_dim_customer Transformation 95

The load_dim_store Transformation 98

The fetch_address Subtransformation 99

The load_dim_actor Transformation 101

The load_dim_film Transformation 102

The load_fact_rental Transformation 107

Summary 109

Part II ETL 111

Chapter 5 ETL Subsystems 113

Introduction to the 34 Subsystems 114

Extraction 114

Subsystems 1-3: Data Profiling, Change Data Capture, and

Extraction 115

Cleaning and Conforming Data 116

Subsystem 4: Data Cleaning and Quality Screen

Handler System 116

Subsystem 5: Error Event Handler 117

Subsystem 6: Audit Dimension Assembler 117

Subsystem 7: Deduplication System 117

Subsystem 8: Data Conformer 118

Data Delivery 118

Subsystem 9: Slowly Changing Dimension Processor 118

Subsystem 10: Surrogate Key Creation System 119

Subsystem 11: Hierarchy Dimension Builder 119

Subsystem 12: Special Dimension Builder 120

Subsystem 13: Fact Table Loader 121

Subsystem 14: Surrogate Key Pipeline 121

Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121

Subsystem 16: Late-Arriving Data Handler 122

Subsystem 17: Dimension Manager System 122

Subsystem 18: Fact Table Provider System 122

Subsystem 19: Aggregate Builder 123

Subsystem 20: Multidimensional (OLAP) Cube Builder 123

Subsystem 21: Data Integration Manager 123

Managing the ETL Environment 123

Summary 126

Chapter 6 Data Extraction 127

Kettle Data Extraction Overview 128

File-Based Extraction 128

Working with Text Files 128

Working with XML files 133

Special File Types 134

Database-Based Extraction 134

Web-Based Extraction 137

Text-Based Web Extraction 137

HTTP Client 137

Using SOAP 138

Stream-Based and Real-Time Extraction 138

Working with ERP and CRM Systems 138

ERP Challenges 139

Kettle ERP Plugins 140

Working with SAP Data 140

ERP and CDC Issues 146

Data Profiling 146

Using [...] DataCleaner 147

Adding Profile Tasks 149

Adding Database Connections 149

Doing an Initial Profile 151

Working with Regular Expressions 151

Profiling and Exploring Results 152

Validating and Comparing Data 153

Using a Dictionary for Column Dependency Checks 153

Alternative Solutions 154

Text Profiling with Kettle 154

CDC: Change Data Capture 154

Source Data-Based CDC 155

Trigger-Based CDC 157

Snapshot-Based CDC 158

Log-Based CDC 162

Which CDC Alternative Should You Choose? 163

Delivering Data 164

Summary 164

Chapter 7 Cleansing and Conforming 167

Data Cleansing 168

Data-Cleansing Steps 169

Using Reference Tables 172

Conforming Data Using Lookup Tables 172

Conforming Data Using Reference Tables 175

Data Validation 179

Applying Validation Rules 180

Validating Dependency Constraints 183

Error Handling 183

Handling Process Errors 184

Transformation Errors 186

Handling Data (Validation) Errors 187

Auditing Data and Process Quality 191

Deduplicating Data 192

Handling Exact Duplicates 193

The Problem of Non-Exact Duplicates 194

Building Deduplication Transforms 195

Step 1: Fuzzy Match 197

Step 2: Select Suspects 198

Step 3: Lookup Validation Value 198

Step 4: Filter Duplicates 199

Scripting 200

Formula 201

JavaScript 202

User-Defined Java Expressions 202

Regular Expressions 203

Summary 205

Chapter 8 Handling Dimension Tables 207

Managing Keys 208

Managing Business Keys 209

Keys in the Source System 209

Keys in the Data Warehouse 209

Business Keys 209

Storing Business Keys 210

Looking Up Keys with Kettle 210

Generating Surrogate Keys 210

The "Add sequence" Step 211

Working with auto_increment or IDENTITY Columns 217

Keys for Slowly Changing Dimensions 217

Loading Dimension Tables 218

Snowflaked Dimension Tables 218

Top-Down Level-Wise Loading 219

Sakila Snowflake Example 219

Sample Transformation 221

Database Lookup Configuration 222

Sample Job 225

Star Schema Dimension Tables 226

Denormalization 226

Denormalizing to 1NF with the "Database lookup" Step 226

Change Data Capture 227

Slowly Changing Dimensions 228

Types of Slowly Changing Dimensions 228

Type 1 Slowly Changing Dimensions 229

The Insert / Update Step 229

Type 2 Slowly Changing Dimensions 232

The "Dimension lookup / update" Step 232

Other Types of Slowly Changing Dimensions 237

Type 3 Slowly Changing Dimensions 237

Hybrid Slowly Changing Dimensions 238

More Dimensions 239

Generated Dimensions 239

Date and Time Dimensions 239

Generated Mini-Dimensions 239

Junk Dimensions 241

Recursive Hierarchies 242

Summary 243

Chapter 9 Loading Fact Tables 245

Loading in Bulk 246

STDIN and FIFO 247

Kettle Bulk Loaders 248

MySQL Bulk Loading 249

LucidDB Bulk Loader 249

Oracle Bulk Loader 249

PostgreSQL Bulk Loader 250

Table Output Step 250

General Bulk Load Considerations 250

Dimension Lookups 251

Maintaining Referential Integrity 251

The Surrogate Key Pipeline 252

Using In-Memory Lookups 253

Stream Lookups 253

Late-Arriving Data 255

Late-Arriving Facts 256

Late-Arriving Dimensions 256

Fact Table Handling 260

Periodic and Accumulating Snapshots 260

Introducing State-Oriented Fact Tables 261

Loading Periodic Snapshots 263

Loading Accumulating Snapshots...

Details
Erscheinungsjahr: 2010
Genre: Informatik
Rubrik: Naturwissenschaften & Technik
Medium: Taschenbuch
Inhalt: 720 S.
ISBN-13: 9780470635179
ISBN-10: 0470635177
Sprache: Englisch
Herstellernummer: 14563517000
Einband: Kartoniert / Broschiert
Autor: Casters, Matt
Bouman, Roland
Dongen, Jos Van
Hersteller: Wiley
John Wiley & Sons
Maße: 235 x 191 x 39 mm
Von/Mit: Matt Casters (u. a.)
Erscheinungsdatum: 28.09.2010
Gewicht: 1,323 kg
Artikel-ID: 101166652
Warnhinweis

Ähnliche Produkte

Ähnliche Produkte