資料指標1

一、資料管理(Data Management)

  1. Data Profiling
  2. Data Standardization
  3. Geocoding
  4. Matching or linking
  5. Monitoring
  6. Batch and Real time

二、Data Profiling

確認資料的定義是什麼。

  1. Data Domain, Dictionary, Glossary
    1. Field Name
    2. Data Type
    3. Occurrence of Null

確認資料裡面有沒有NULL值,或能不能包含NULL值。EX性別、學號。或者,某個欄位的NULL值比例太高,就要評估這個欄位是否有存在的必要性。

  • 練習:查詢[SalesOrderHeader].[CreditCardID]欄位,空值與非空值的資料筆數。

SELECT SUM(CASE WHEN CreditCardID IS NULL THEN 1 ELSE 0 END) AS NULL_COUNT, SUM(CASE WHEN CreditCardID IS NULL THEN 0 ELSE 1 END) AS NOTNULL_COUNT

FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]

D. Indexing

索引會影響資料處理的速度(效能)。

E. String Pattern

資料裡面的文字描述格式是否一致。EX地址。

F. Constraint, Validation rules

驗證規則

G. Description

  • 練習:用變數儲存語法。查詢salesorderheader資料表的所有資料。

declare @cmd1 nvarchar(100)

set @cmd1 = ‘select * from ‘

set @cmd1 = @cmd1 + ‘sales.salesorderheader’

–print @cmd1

exec(@cmd1)

  1. Relationship Discovery

資料表的關係,去發掘資料表之間有什麼相關性。

A. ER-Diagram/ER-Model

B. Relationship

C. Cardinality

D. Referential Test

3. Data Quality Summary

Minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum

(1) 查詢SalesOrderHeader資料表中,銷售金額的相關計算:

建立檢視表,選取[SalesOrderHeader]資料表,先使用COUNT計算全部資料筆數、SalesOrderID筆數、CustomerID筆數,再計算SubTotal的合計、最小值、最大值、平均、變異數

SELECT          COUNT(*) AS TOTAL_COUNT, COUNT(DISTINCT SalesOrderID) AS SALES_COUNT, COUNT(DISTINCT CustomerID) AS CUSTOMER_COUNT, SUM(SubTotal) AS 銷售金額_合計, MIN(SubTotal) AS 銷售金額_最小, MAX(SubTotal) AS 銷售金額_最大, AVG(SubTotal) AS 銷售金額_平均, STDEV(SubTotal) AS 銷售金額_變異

FROM              Sales.SalesOrderHeader

(2) 查詢檢視SalesOrderHeader資料表中,多個欄位的相關計算:

建立檢視表,選取[SalesOrderHeader]資料表,用上一個練習的語法修改,並在最前面補上一個欄位(N’欲查詢欄位名稱’ AS ‘欄位’)。

SELECT          N’銷售金額’ AS ‘欄位’, SUM(SubTotal) AS 合計, MIN(SubTotal) AS 最小, MAX(SubTotal) AS 最大, AVG(SubTotal) AS 平均, STDEV(SubTotal) AS 變異

FROM              Sales.SalesOrderHeader

UNION

SELECT          N’稅金’ AS ‘欄位’, SUM(TaxAmt) AS 合計, MIN(TaxAmt) AS 最小, MAX(TaxAmt) AS 最大, AVG(TaxAmt) AS 平均, STDEV(TaxAmt) AS 變異

FROM              Sales.SalesOrderHeader

  1. Error Discovering

illegal values, misspellings, missing values, varying value representation, and duplicates

三、SQL Server Information_Schema

  1. 建立查詢,查詢COLUMNS中,TABLE_SCHEMA、TABLE_NAME、COLUMN_NAME三個欄位,TABLE_NAME = ‘SALESORDERHEADER’,DATA_TYPE = ‘int’

(AdventureWork2012→檢視→系統檢視表→INFORMATION_SCHEMA.COLUMNS)

SELECT [TABLE_SCHEMA]

,[TABLE_NAME]

,[COLUMN_NAME]

FROM [AdventureWorks2012].[INFORMATION_SCHEMA].[COLUMNS]

WHERE TABLE_NAME = ‘SALESORDERHEADER’ AND DATA_TYPE = ‘int’

四、函數

declare @cmd1 nvarchar(100)

set @cmd1 = ‘select * from ‘

set @cmd1 = @cmd1 + ‘sales.salesorderheader’

print @cmd1

exec(@cmd1)

五、資料指標(Cursor)

練習:查詢SALESORDERHEADER資料表中,資料類型為整數的欄位有哪些,並計算每個欄位的最大最小值、平均、變異。

–宣告Cursor for 迴圈

Declare mydata Cursor for

–for迴圈裡面的資料

SELECT [COLUMN_NAME]

FROM [AdventureWorks2012].[INFORMATION_SCHEMA].[COLUMNS]

WHERE TABLE_NAME = ‘SALESORDERHEADER’ AND DATA_TYPE = ‘int’

–宣告變數

DECLARE @cmd1 nvarchar(2000)

DECLARE @colname nvarchar(100)

–開始run cursor

OPEN mydata

FETCH NEXT FROM mydata INTO @colname  –將第一筆資料填入變數

–宣告變數、設定變數值

DECLARE @flag int

SET @flag = 0

SET @cmd1 = ”

WHILE @@FETCH_STATUS = 0  –檢查是否有讀取到資料;WHILE用來處理迴圈,當為TRUE時則進入迴圈執行

BEGIN  –BEGIN…END表示一個區塊,通常搭配IF…ELSE或WHILE一起使用

IF @flag=1

BEGIN

SET @cmd1 = @cmd1 + ‘ union ‘;

END

SET @cmd1= @cmd1 + ‘Select ”’ + @colname + ”’ AS ‘ + N’欄位名稱’ + ‘, MAX(‘ + @colname + ‘) AS ‘ + N’最大值’ + ‘, MIN(‘ + @colname + ‘) AS ‘ + N’最小值’ + ‘, AVG(‘ + @colname + ‘) AS ‘ + N’平均’ + ‘, STDEV(‘ + @colname + ‘) AS ‘ + N’變異’ + ‘ from Sales.SalesOrderHeader’

SET @flag = 1

FETCH NEXT FROM mydata INTO @colname  –將下一筆資料填入變數

END

print (@cmd1)

exec (@cmd1)

CLOSE mydata  –關閉cursor與參數的關聯

DEALLOCATE mydata  –將cursor物件從記憶體移除

  • 資料庫資料表名稱是否區分大小寫

資料庫右鍵→屬性→「一般」頁籤→定序

CI:不區分大小寫,CS:區分大小寫

Leave a Reply