前言
本实验以DataLeap on LAS为例,实际操作火山引擎数据产品,完成数据仓库的构建。
关于实验
- 预计部署时间:50分钟
- 级别:初级
- 相关产品:大数据开发套件、湖仓一体分析服务LAS
- 受众: 通用
环境说明
- 已购买DataLeap产品
- 已创建湖仓一体LAS队列
- 子账户具备DataLeap相关权限(参考:https://www.volcengine.com/docs/6260/65408)
实验说明
步骤1:创建项目
步骤2:计算资源组设置
本案例以湖仓一体Las为例,这里选择已创建的湖仓一体服务 完成上述步骤之后,创建好的DataLeap项目如下:
本Demo中以湖仓一体LAS的样例数据为实验数据 (TPC-DS中的样例表:https://www.volcengine.com/docs/6492/81953)
步骤3:导入样例数据
导入数据四张样例表: 样例中的四张表分别代表:
- [事实表] Store_Sales: 销售记录表。
- [维度表] Customers: 客户信息表。
- [维度表] Stores: 商店信息表。
- [维度表] Date_Dim: 时间信息表。
基于上述表数据,我们的数据分析需求如下: 1)“查看最近三天商店销售额情况(未促销)TOP3” 2)“查看最近三天消费最多的用户与金额TOP3” 3)“获取商店地域分布情况”
经典数据仓库按照大类分为基础数据层、应用数据层。
本样例中,我们的数据仓库建设思路是:
- ODS(从生产系统采集原始数据,并将原始数据集成冗余宽表)
- DWD(对ODS冗余表数据进行轻度过滤处理)
- DWM (基于DWD表与业务需求,轻度聚合最近三天的数据)
- APP (基于DWD或DWM,输出具体报表信息)
在“数据地图”中创建数据仓库中要使用到的表: 本案例中库信息为:demo_tpc_ds_2022_11_07_59(请结合具体情况修改)
步骤4: 数据仓库分层建表
ODS(数据聚合宽表)
CREATE TABLE demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df (
id bigint comment '主键',
ss_sold_date_sk bigint comment '销售日期',
ss_sold_time_sk bigint comment '销售时间',
ss_item_sk bigint comment '销售物品',
ss_customer_sk bigint comment '客户信息',
ss_cdemo_sk bigint,
ss_hdemo_sk bigint,
ss_addr_sk bigint comment '地址信息',
ss_store_sk bigint comment '商店信息',
ss_promo_sk bigint comment '促销信息',
ss_ticket_number bigint comment '订单号',
ss_quantity int comment '数量',
ss_wholesale_cost double comment '整个销售总额',
ss_list_price double comment '列表价格',
ss_sales_price double comment '销售价格',
ss_ext_discount_amt double comment '外部折扣相关信息',
ss_ext_sales_price double comment '外部销售信息',
ss_ext_wholesale_cost double comment '外部整个销售总额',
ss_ext_list_price double comment '外部列表价格',
ss_ext_tax double comment '外部税相关',
ss_coupon_amt double comment '打折券',
ss_net_paid double comment '消费额',
ss_net_paid_inc_tax double comment '消费税',
ss_net_profit double comment '盈利',
c_customer_sk bigint comment '客户信息',
c_customer_id bigint comment '客户id',
c_current_cdemo_sk bigint,
c_current_hdemo_sk bigint,
c_current_addr_sk bigint comment '地址信息',
c_first_shipto_date_sk bigint comment '运输时间',
c_first_sales_date_sk bigint comment '销售时间',
c_salutation string comment '称呼',
c_first_name string comment '名',
c_last_name string comment '姓',
c_preferred_cust_flag string comment '偏好标记',
c_birth_day int comment '生日',
c_birth_month int comment '生日月份',
c_birth_year int comment '出生年',
c_birth_country string comment '出生国家',
c_login string comment '登录信息',
c_email_address string comment '邮件地址',
c_last_review_date_sk bigint comment '上次评价日期',
s_store_sk bigint comment '商店',
s_store_id string comment '商店id',
s_rec_start_date date comment '商店开始时间',
s_rec_end_date date comment '商店结束时间',
s_closed_date_sk bigint comment '商店关闭时间',
s_store_name string comment '商店名称',
s_number_employees int comment '雇员数量',
s_floor_space int,
s_hours string comment '小时',
s_manager string comment '经理',
s_market_id int comment '市场信息',
s_geography_class string comment '地理位置信息',
s_market_desc string comment '市场描述',
s_market_manager string comment '市场经理',
s_division_id int,
s_division_name string,
s_company_id int comment '公司id',
s_company_name string comment '公司名称',
s_street_number string comment '街道编号',
s_street_name string comment '街道名称',
s_street_type string comment '街道类型',
s_suite_number string comment '住宿编号',
s_city string comment '城市',
s_county string comment '国家',
s_state string comment '洲',
s_zip string comment '邮政编号',
s_gmt_offset double,
s_tax_precentage double comment '税收百分比',
d_date_sk bigint comment '日期',
d_date_id string comment '日期id',
d_date date comment '具体date',
d_month_seq int comment '第几月份',
d_week_seq int comment '第几周',
d_quarter_seq int comment '第几季度',
d_year int comment '年份',
d_dow int,
d_moy int comment '月份相关信息',
d_dom int,
d_qoy int,
d_fy_year int comment '年份相关信息',
d_fy_quarter_seq int comment '季度序列',
d_fy_week_seq int comment '周数序列',
d_day_name string comment '显示的天数',
d_quarter_name string comment '显示的季度',
d_holiday string comment '节日',
d_weekend string comment '周末',
d_following_holiday string comment '后续节日信息',
d_first_dom int,
d_last_dom int,
d_same_day_ly int comment '是否同一天',
d_same_day_lq int comment '是否同一天',
d_current_day string comment '是否为当前的天数',
d_current_week string comment '是否为当前的周数',
d_current_month string comment '是否为当前的月份',
d_current_quarter string comment '是否为当前的季度',
d_current_year string comment '是否为当前的年份'
)
DWD(轻度过滤)
CREATE TABLE demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df(
id bigint comment '主键',
ss_sold_date_sk bigint comment '销售日期',
ss_sold_time_sk bigint comment '销售时间',
ss_item_sk bigint comment '销售物品',
ss_customer_sk bigint comment '客户信息',
ss_cdemo_sk bigint,
ss_hdemo_sk bigint,
ss_addr_sk bigint comment '地址信息',
ss_store_sk bigint comment '商店信息',
ss_promo_sk bigint comment '促销信息',
ss_ticket_number bigint comment '订单号',
ss_quantity int comment '数量',
ss_wholesale_cost double comment '整个销售总额',
ss_list_price double comment '列表价格',
ss_sales_price double comment '销售价格',
ss_ext_discount_amt double comment '外部折扣相关信息',
ss_ext_sales_price double comment '外部销售信息',
ss_ext_wholesale_cost double comment '外部整个销售总额',
ss_ext_list_price double comment '外部列表价格',
ss_ext_tax double comment '外部税相关',
ss_coupon_amt double comment '打折券',
ss_net_paid double comment '消费额',
ss_net_paid_inc_tax double comment '消费税',
ss_net_profit double comment '盈利',
c_customer_sk bigint comment '客户信息',
c_customer_id bigint comment '客户id',
c_current_cdemo_sk bigint,
c_current_hdemo_sk bigint,
c_current_addr_sk bigint comment '地址信息',
c_first_shipto_date_sk bigint comment '运输时间',
c_first_sales_date_sk bigint comment '销售时间',
c_salutation string comment '称呼',
c_first_name string comment '名',
c_last_name string comment '姓',
c_preferred_cust_flag string comment '偏好标记',
c_birth_day int comment '生日',
c_birth_month int comment '生日月份',
c_birth_year int comment '出生年',
c_birth_country string comment '出生国家',
c_login string comment '登录信息',
c_email_address string comment '邮件地址',
c_last_review_date_sk bigint comment '上次评价日期',
s_store_sk bigint comment '商店',
s_store_id string comment '商店id',
s_rec_start_date date comment '商店开始时间',
s_rec_end_date date comment '商店结束时间',
s_closed_date_sk bigint comment '商店关闭时间',
s_store_name string comment '商店名称',
s_number_employees int comment '雇员数量',
s_floor_space int,
s_hours string comment '小时',
s_manager string comment '经理',
s_market_id int comment '市场信息',
s_geography_class string comment '地理位置信息',
s_market_desc string comment '市场描述',
s_market_manager string comment '市场经理',
s_division_id int,
s_division_name string,
s_company_id int comment '公司id',
s_company_name string comment '公司名称',
s_street_number string comment '街道编号',
s_street_name string comment '街道名称',
s_street_type string comment '街道类型',
s_suite_number string comment '住宿编号',
s_city string comment '城市',
s_county string comment '国家',
s_state string comment '洲',
s_zip string comment '邮政编号',
s_gmt_offset double,
s_tax_precentage double comment '税收百分比',
d_date_sk bigint comment '日期',
d_date_id string comment '日期id',
d_date date comment '具体date',
d_month_seq int comment '第几月份',
d_week_seq int comment '第几周',
d_quarter_seq int comment '第几季度',
d_year int comment '年份',
d_dow int,
d_moy int comment '月份相关信息',
d_dom int,
d_qoy int,
d_fy_year int comment '年份相关信息',
d_fy_quarter_seq int comment '季度序列',
d_fy_week_seq int comment '周数序列',
d_day_name string comment '显示的天数',
d_quarter_name string comment '显示的季度',
d_holiday string comment '节日',
d_weekend string comment '周末',
d_following_holiday string comment '后续节日信息',
d_first_dom int,
d_last_dom int,
d_same_day_ly int comment '是否同一天',
d_same_day_lq int comment '是否同一天',
d_current_day string comment '是否为当前的天数',
d_current_week string comment '是否为当前的周数',
d_current_month string comment '是否为当前的月份',
d_current_quarter string comment '是否为当前的季度',
d_current_year string comment '是否为当前的年份'
)
DWM(封装业务规则)
CREATE TABLE demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df (
id bigint comment '主键',
ss_sold_date_sk bigint comment '销售日期',
ss_sold_time_sk bigint comment '销售时间',
ss_item_sk bigint comment '销售物品',
ss_customer_sk bigint comment '客户信息',
ss_cdemo_sk bigint,
ss_hdemo_sk bigint,
ss_addr_sk bigint comment '地址信息',
ss_store_sk bigint comment '商店信息',
ss_promo_sk bigint comment '促销信息',
ss_ticket_number bigint comment '订单号',
ss_quantity int comment '数量',
ss_wholesale_cost double comment '整个销售总额',
ss_list_price double comment '列表价格',
ss_sales_price double comment '销售价格',
ss_ext_discount_amt double comment '外部折扣相关信息',
ss_ext_sales_price double comment '外部销售信息',
ss_ext_wholesale_cost double comment '外部整个销售总额',
ss_ext_list_price double comment '外部列表价格',
ss_ext_tax double comment '外部税相关',
ss_coupon_amt double comment '打折券',
ss_net_paid double comment '消费额',
ss_net_paid_inc_tax double comment '消费税',
ss_net_profit double comment '盈利',
c_customer_sk bigint comment '客户信息',
c_customer_id bigint comment '客户id',
c_current_cdemo_sk bigint,
c_current_hdemo_sk bigint,
c_current_addr_sk bigint comment '地址信息',
c_first_shipto_date_sk bigint comment '运输时间',
c_first_sales_date_sk bigint comment '销售时间',
c_salutation string comment '称呼',
c_first_name string comment '名',
c_last_name string comment '姓',
c_preferred_cust_flag string comment '偏好标记',
c_birth_day int comment '生日',
c_birth_month int comment '生日月份',
c_birth_year int comment '出生年',
c_birth_country string comment '出生国家',
c_login string comment '登录信息',
c_email_address string comment '邮件地址',
c_last_review_date_sk bigint comment '上次评价日期'
)
APP(差异化数据服务,提供看板数据)
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_ticket_top (
store_sk bigint comment '商店信息',
wholesale_cost double comment '整个销售总额'
)
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_customer_top3_recent3(
customer_sk bigint comment '客户信息',
salutation string comment '称呼',
first_name string comment '名',
last_name string comment '姓',
wholesale_cost double comment '整个销售总额'
)
CREATE TABLE demo_tpc_ds_2022_11_07_59.app_demo_shop_distribution(
county string comment '国家',
distribution_count int comment '分布商店总数'
)
步骤5: 数仓任务构建
在任务开发中,新建“数据开发”任务:
ODS层数据任务
INSERT INTO demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df
SELECT row_number() OVER(
ORDER BY
RAND()
) AS id,
sales.ss_sold_date_sk,
sales.ss_sold_time_sk,
sales.ss_item_sk,
sales.ss_customer_sk,
sales.ss_cdemo_sk,
sales.ss_hdemo_sk,
sales.ss_addr_sk,
sales.ss_store_sk,
sales.ss_promo_sk,
sales.ss_ticket_number,
sales.ss_quantity,
sales.ss_wholesale_cost,
sales.ss_list_price,
sales.ss_sales_price,
sales.ss_ext_discount_amt,
sales.ss_ext_sales_price,
sales.ss_ext_wholesale_cost,
sales.ss_ext_list_price,
sales.ss_ext_tax,
sales.ss_coupon_amt,
sales.ss_net_paid,
sales.ss_net_paid_inc_tax,
sales.ss_net_profit,
cust.c_customer_sk,
cust.c_customer_id,
cust.c_current_cdemo_sk,
cust.c_current_hdemo_sk,
cust.c_current_addr_sk,
cust.c_first_shipto_date_sk,
cust.c_first_sales_date_sk,
cust.c_salutation,
cust.c_first_name,
cust.c_last_name,
cust.c_preferred_cust_flag,
cust.c_birth_day,
cust.c_birth_month,
cust.c_birth_year,
cust.c_birth_country,
cust.c_login,
cust.c_email_address,
cust.c_last_review_date_sk,
stores.s_store_sk,
stores.s_store_id,
stores.s_rec_start_date,
stores.s_rec_end_date,
stores.s_closed_date_sk,
stores.s_store_name,
stores.s_number_employees,
stores.s_floor_space,
stores.s_hours,
stores.s_manager,
stores.s_market_id,
stores.s_geography_class,
stores.s_market_desc,
stores.s_market_manager,
stores.s_division_id,
stores.s_division_name,
stores.s_company_id,
stores.s_company_name,
stores.s_street_number,
stores.s_street_name,
stores.s_street_type,
stores.s_suite_number,
stores.s_city,
stores.s_county,
stores.s_state,
stores.s_zip,
stores.s_gmt_offset,
stores.s_tax_precentage,
dim.d_date_sk,
dim.d_date_id,
dim.d_date,
dim.d_month_seq,
dim.d_week_seq,
dim.d_quarter_seq,
dim.d_year,
dim.d_dow,
dim.d_moy,
dim.d_dom,
dim.d_qoy,
dim.d_fy_year,
dim.d_fy_quarter_seq,
dim.d_fy_week_seq,
dim.d_day_name,
dim.d_quarter_name,
dim.d_holiday,
dim.d_weekend,
dim.d_following_holiday,
dim.d_first_dom,
dim.d_last_dom,
dim.d_same_day_ly,
dim.d_same_day_lq,
dim.d_current_day,
dim.d_current_week,
dim.d_current_month,
dim.d_current_quarter,
dim.d_current_year
FROM demo_tpc_ds_2022_11_07_59.store_sales sales
JOIN demo_tpc_ds_2022_11_07_59.customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
JOIN demo_tpc_ds_2022_11_07_59.store stores
ON sales.ss_store_sk = stores.s_store_sk
JOIN demo_tpc_ds_2022_11_07_59.date_dim dim
ON sales.ss_sold_date_sk = dim.d_date_sk
DWD层数据任务
INSERT INTO demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df
SELECT row_number() OVER(
ORDER BY
RAND()
) AS id,
ss_sold_date_sk,
ss_sold_time_sk,
ss_item_sk,
ss_customer_sk,
ss_cdemo_sk,
ss_hdemo_sk,
ss_addr_sk,
ss_store_sk,
ss_promo_sk,
ss_ticket_number,
ss_quantity,
ss_wholesale_cost,
ss_list_price,
ss_sales_price,
ss_ext_discount_amt,
ss_ext_sales_price,
ss_ext_wholesale_cost,
ss_ext_list_price,
ss_ext_tax,
ss_coupon_amt,
ss_net_paid,
ss_net_paid_inc_tax,
ss_net_profit,
c_customer_sk,
c_customer_id,
c_current_cdemo_sk,
c_current_hdemo_sk,
c_current_addr_sk,
c_first_shipto_date_sk,
c_first_sales_date_sk,
c_salutation,
c_first_name,
c_last_name,
c_preferred_cust_flag,
c_birth_day,
c_birth_month,
c_birth_year,
c_birth_country,
c_login,
c_email_address,
c_last_review_date_sk,
s_store_sk,
s_store_id,
s_rec_start_date,
s_rec_end_date,
s_closed_date_sk,
s_store_name,
s_number_employees,
s_floor_space,
s_hours,
s_manager,
s_market_id,
s_geography_class,
s_market_desc,
s_market_manager,
s_division_id,
s_division_name,
s_company_id,
s_company_name,
s_street_number,
s_street_name,
s_street_type,
s_suite_number,
s_city,
s_county,
s_state,
s_zip,
s_gmt_offset,
s_tax_precentage,
d_date_sk,
d_date_id,
d_date,
d_month_seq,
d_week_seq,
d_quarter_seq,
d_year,
d_dow,
d_moy,
d_dom,
d_qoy,
d_fy_year,
d_fy_quarter_seq,
d_fy_week_seq,
d_day_name,
d_quarter_name,
d_holiday,
d_weekend,
d_following_holiday,
d_first_dom,
d_last_dom,
d_same_day_ly,
d_same_day_lq,
d_current_day,
d_current_week,
d_current_month,
d_current_quarter,
d_current_year
FROM demo_tpc_ds_2022_11_07_59.ods_demo_customer_store_sales_df ods
WHERE ods.ss_sold_date_sk IS NOT NULL
AND ods.ss_customer_sk IS NOT NULL
DWM层数据任务
在需求一、二中,都有分析近三天数据的需求,因此在DWM层。 基于DWD层,筛选出近三天的数据用于构建DWM层,如果未来有基于近三天的需求(例如:查看最近3天地区购买排行榜),可以基于此分层进一步输出APP层数据报表。
INSERT INTO demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df
SELECT row_number() OVER(
ORDER BY
RAND()
) AS id,
ss_sold_date_sk,
ss_sold_time_sk,
ss_item_sk,
ss_customer_sk,
ss_cdemo_sk,
ss_hdemo_sk,
ss_addr_sk,
ss_store_sk,
ss_promo_sk,
ss_ticket_number,
ss_quantity,
ss_wholesale_cost,
ss_list_price,
ss_sales_price,
ss_ext_discount_amt,
ss_ext_sales_price,
ss_ext_wholesale_cost,
ss_ext_list_price,
ss_ext_tax,
ss_coupon_amt,
ss_net_paid,
ss_net_paid_inc_tax,
ss_net_profit,
c_customer_sk,
c_customer_id,
c_current_cdemo_sk,
c_current_hdemo_sk,
c_current_addr_sk,
c_first_shipto_date_sk,
c_first_sales_date_sk,
c_salutation,
c_first_name,
c_last_name,
c_preferred_cust_flag,
c_birth_day,
c_birth_month,
c_birth_year,
c_birth_country,
c_login,
c_email_address,
c_last_review_date_sk
FROM demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df dwd
WHERE d_date IN (
SELECT DISTINCT d_date
FROM demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df
ORDER BY
d_date
LIMIT 3
)
APP层数据任务
需求一:
INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_ticket_top
SELECT
ss_store_sk as store_sk,
sum(ss_wholesale_cost) as wholesale_cost
FROM
demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df dwm
WHERE
dwm.ss_net_profit >0
GROUP BY ss_store_sk
需求二:
INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_customer_top3_recent3
SELECT
ss_customer_sk as customer_sk,
c_salution as salution,
c_first_name as first_name,
c_last_name as last_name,
sum(ss_wholesale_cost) as wholesale_cost
FROM
demo_tpc_ds_2022_11_07_59.dwm_demo_customer_store_sales_df dwm
GROUP BY
ss_customer_sk,
c_salution,
c_first_name,
c_last_name
需求三:
INSERT INTO demo_tpc_ds_2022_11_07_59.app_demo_shop_distribution
SELECT s_county AS country,
count(s_store_id) AS distribution_count
FROM demo_tpc_ds_2022_11_07_59.dwd_demo_customer_store_sales_df dwd
GROUP BY
dwd.s_county
步骤6:样例结果检查
需求一: 需求二: 需求三: