본문 바로가기
Data-driven Methodology/Data Engineering

Extract / Transform / Load (ETL)

by goatlab 2022. 5. 11.
728x90
반응형
SMALL

Extract / Transform / Load (ETL)

 

 

여러 파일 형식에서 데이터를 추출하고, API 및 웹 스크래핑을 통해 데이터를 수집하고, 마지막으로 수집된 데이터를 로드할 수 있는 형식으로 변환하는 등의 특정 작업을 수행한다. 그런 다음 동료가 작업을 검토하고 평가할 수 있도록 Python 노트북과 스크린샷을 제출한다.

 

추출 (Extract)

 

import glob

list_csv=glob.glob('*.csv')
list_json=glob.glob('*.json')
# extract csv
def extract_from_csv(file_to_process):
	dataframe = pd.read_csv(file_to_process)
	return dataframe

df = extract_from_csv('test.csv')
# extract json
def extract_from_json(file_to_process):
	dataframe = pd.read_csv(file_to_process, lines=True)
	return dataframe

df = extract_from_json('test.json')
# Extract Function
def extract():
	# create an empty data frame to hold extracted data
	extracted_data = pd.DataFrame(columns=['name', 'height', 'weight']

	# process all csv files
	for csvfile in glob.glob("*.csv"):
		extracted_data = extracted_data.append(extract_from_csv(csvfile),
		ignore_index=True)

	# process all json files
	for jsonfile in glob.glob("*.json"):
		extracted_data = extracted_data.append(extract_from_json(jsonfile),
		ignore_index=True)

	return extracted_data

empty data frame

 

변환 (Transform)

 

# Transform Function
def transform(data):
	# Convert height which is in inches to millimeter
	# Convert inches to meters and round off to two decimals (one inch is 0.0254 meters)
	data['height'] = round(data.height * 0.0254, 2)

	# Convert pounds to kilograms and round off to two decimals (one pound is 0.45359237 kilograms)
	data['weight'] = round(data.weight * 0.45359237, 2)

	return data

 

로드 (Load)

 

# Load Function
def load(targerfile, data_to_load):
	data_to_load.to_csv(targetfile)

targetfile = "transformed_data.csv"
load(targetfile, transformed_data)
# Logging Entries
from datetime import datetime

def log(message):
	timestamp_format = '%Y-%h-%d-%H:%M:%S'
	now = datatime.now()
	timestamp = now.strftime(timestamp_format)
	with open("logfile.txt", "a") as f:
		f.write(timestamp+','+message+'\n')
# Final Call
log("ETL Job Started")

log("Extract phase Started")
extracted_data = extract()
log("Extract phase Ended")

log("Transform Job Started")
transform_data = transform(extracted_data)
log("Transform Job Ended")

log("Load Job Started")
load(targetfile, transformed_data)
log("Load Job Ended")

log("ETL Job Ended")
728x90
반응형
LIST