还在用Pandas？Polars！这篇就够了，2.5万字+详解！ - 文章 - 开发者社区


            
作者投稿：光州鲸
            
联系方式：DTB3132897406
            
以下代码一键运行「阅读原文」

前言

当学到机器学习的时候，用Pandas处理大量数据速度很慢。我问了些人，有个大佬给我说可以使用Polars处理数据，我试了一下发现Polars速度是真的快（比pandas快出10倍不止）。于是我开始学，发现网上没有多少教程，我只能读英文的Polars官方文档（哭），虽然很难但也总算坚持了下来，我现在既然学会了点就赶紧写出来一些浅薄的知识给大家点启发。

什么？你问Polars和pandas区别是什么，你记住一个字“快就行了”，其他的请直接问AI（手动狗头）虽然当前功能没有pandas完善，但是处理日常任务已经很好用了。废话不多说，开始！

目录	目录
1、列的选择	9、fold()实现按行操作（相当于pandas的apply(axis=1)）
2.when().then().otherwise()条件选择	10、List和Array
3、数据类型转换	11、Polars还有一种叫struct的结构
4、处理字符串类型的数据	12、Join横向连接表格
5、类别数据提高性能	13、concat拼接表格
6、用Lazyframe进行集成操作	14、Pivots 列转行
7、缺失值处理	15、Unpivots列转行
8、窗口函数over()（熟悉SQL的对这个应该不陌生）	16、时间序列分析

一、整体概念

polars 整体上很像SQL语言，在数据结构上像pandas，Polars也有Dataframe和Series。


        
          
import polars as pl  
  
s = pl.Series("ints", [1, 2, 3, 4, 5])  
print(s)  
  
from datetime import date  
  
df = pl.DataFrame(  
    {  
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],  
        "birthdate": [  
            date(1997, 1, 10),  
            date(1985, 2, 15),  
            date(1983, 3, 22),  
            date(1981, 4, 30),  
        ],  
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)  
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)  
    }  
)  
  
print(df)

picture.image

在操作数据时，Polars使用表达式（expression）和上下文（context）来实现，表达式是对数据表一部分数据进行选中和修改的方法，上下文像一个麻袋一样装着表达式。


        
          
import polars as pl  
    
expression=pl.col("weight") / (pl.col("height") ** 2)   #这个就是表达式表示‘weight’列的值除以‘height’列的平方值，pl.col是用来选中列的  
print(expression)

picture.image

上下文有select，with_columns，filter，group_by几种下面是它们各自的作用，数据表在文章开头

select只用于数据，也可以同时对所选数据进行修改，其实作用就相当于pandas里的‘ df[df['a']>100] ’或者df.query()。但是select更为灵活


        
          
#select只用于数据，也可以同时对所选数据进行修改，其实作用就相当于pandas里的‘ df[df['a']>100] ’或者df.query()。但是select更为灵活  
result = df.select(  
    pl.col("name"),  #选name列  
    pl.col("birthdate").dt.year().alias("birth\_year"),  #只选birthdate列中数据的年份部分，dt是date对象后面会讲，alias是命名所选数据列这个要记住。  
    (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),  
)  
print(result)  
#选择的同时修改  
result=df.select(pl.col('weight')/10) #可以在选取列的时候同时对其进行加减乘除操作  
print(result)  
#当你错把select当做filter用（filter的介绍在下面）,那么返回的其实是布尔值，而不是符合条件的行  
result=df.select(pl.col('weight')>70)  
print(result)

picture.image

with_columns 用来添加新列和修改原有列


        
          
#with\_columns 用来添加新列和修改原有列  
  
df1=df.clone()  #copy原表  
df1=df1.with_columns(pl.col('birthdate').dt.year().alias('year'))  
print(df1)  
  
#如果要修改原有的列那就别取别名  
df1=df1.with_columns(pl.col('birthdate').dt.year())  
print(df1)

picture.image

filter 用来筛选符合条件的行


        
          
#filter 用来筛选符合条件的行  
result = df.filter(  
    pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),  #筛选birthdate列中数值在给定时间段里的行  
    pl.col("height") > 1.7,  #筛选height列的值大于1.7的行， 注意这一行代码个跟上一行代码是“&”关系，满足两个条件的行才会被筛选  
)  
print(result)  
  
#我要是想实现‘或’的关系怎么办？这样写：  
result = df.filter(  
    ((pl.col('weight') > 57) & (pl.col('height')<1.7))  #这个我相信你肯定能看懂  
)  
print(result)

picture.image

groupby用来实现分组操作


        
          
#groupby用来实现分组操作  
result = df.group_by(  
    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),  #按照（年份整除10，再乘10得到decade）进行分组，分组完成后用agg选中或者处理其他列  
).agg(  
    pl.col("name"),  
    pl.col('weight').mean()  
)  
print(result)  
  
#我们发现分组后各组的数据进行了reduction，weight列求的是分组均值可以理解但是name列竟然是分组合并成了类似于列表的结构，这其实是Polars的List API有一堆方法，照例后面会讲。

我们发现分组后各组的数据进行了reduction，weight列求的是分组均值可以理解但是name列竟然是分组合并成了类似于列表的结构，这其实是Polars的List API有一堆方法，照例后面会讲。

picture.image

Polars有LazyAPI，是一种代码的集成化书写当时，可以把一大堆操作一起完成，速度也快，看着也美观，建议以后这样写。这样写前提是要求Lazyframe，这个可以由Dataframe.lazy()转化，或者读取文件时用scan_csv() (不用记，后面会讲)。数据处理完之后用collect()方法把Lazyframe再转化成Dataframe，我给个麻烦点的示例，大致看一遍即可不用理解。


        
          
data=(  
pl.read_csv(r'hello.csv',try_parse_dates=True,null_values='null',schema_overrides=scheme_overrides)  
.lazy()  
.with_columns(pl.col('Discount\_rate').map_elements(lambda x:'100:10' if re.search(r'0.\d+',x) else x,return_dtype=pl.String))  
.with_columns(pl.col('Discount\_rate').str.split(':').list.eval(pl.element().cast(pl.Int16, strict=False)))  
.with_columns(  
pl.col('Discount\_rate').list.get(index=0, null_on_oob=True).alias('up\_to\_price'),  
pl.col('Discount\_rate').list.get(index=1, null_on_oob=True).alias('discount')  
)  
.with_columns(  
pl.when(pl.col('Coupon\_id') == 'fixed').then(1).otherwise(0).alias('is\_fixed').cast(pl.Int8),  
pl.col('Coupon\_id').map_elements(lambda x: pl.Null if x == 'fixed' else x, return_dtype=pl.String)  
  
    # pl.when(pl.col('Coupon\_id').is\_null()).then(0).otherwise(1).alias('Coupon\_id').cast(pl.Int8),      
)      
.drop('Date\_received','Coupon\_id','Discount\_rate')      
.collect()      
)

这是官方示例


        
        
            

          #这是官方示例
            

          ''
          '  
q = (  
    pl.scan\_csv("docs/assets/data/iris.csv")  
    .filter(pl.col("sepal\_length") > 5)  
    .group\_by("species")  
    .agg(pl.col("sepal\_width").mean())  
)  
  
df = q.collect()  
'
          ''

二、数据类型

先大致看看，不懂的后面会讲到

picture.image

三、常见操作

最基本的已经说完了，那再讲一下数据处理的操作熟悉一下流程，并且学一些新方法

1、列的选择


        
          
#初始数据  
from datetime import date, datetime  
  
import polars as pl  
  
df = pl.DataFrame(  
    {  
        "id": [9, 4, 2],  
        "place": ["Mars", "Earth", "Saturn"],  
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),  
        "sales": [33.4, 2142134.1, 44.7],  
        "has\_people": [False, True, False],  
        "logged\_at": pl.datetime_range(  
            datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True  
        ),  
    }  
).with_row_index("index")  
print(df)

picture.image

选择所有列


        
          
#选择所有列  
out = df.select(pl.col("*"))  
  
# 等于  
out = df.select(pl.all())  
print(out)

picture.image

排除某些列


        
          
#排除某些列  
out = df.select(pl.col("*").exclude("logged\_at", "index")) #exclude要记住  
print(out)

picture.image

用正则


        
          
#用正则  
out = df.select(pl.col("^.*(as|sa).*$")) #选名字中有as或sa的列，这个知到就行，没必要记住感觉用不到  
print(out)

picture.image

通过数据类型


        
          
#通过数据类型  
out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique()) #nunique找每列中不同值的数量，记住  
print(out)

picture.image

用selector这个感觉没什么用了解即可


        
          
#用selector这个感觉没什么用了解即可  
import polars.selectors as cs  #  
  
out = df.select(cs.integer(), cs.string())  
print(out)  
  
out = df.select(cs.numeric() - cs.first())  #选数值类型的列，并且这些列的每个值要减去列的本第一个值  
print(out)  
  
out = df.select(cs.by_name("index") | ~cs.numeric())  #选名字是‘index’的列和不是数值类型的列  
print(out)  
  
out = df.select(cs.contains("index"), cs.matches(".*\_.*")) #选名字包含‘index’的列和名字内有‘\_’的列  
print(out)

picture.image

2.when().then().otherwise()条件选择

这个还挺常用的，要记住


        
          
#这个还挺常用的，要记住  
import numpy as np  
df = pl.DataFrame(  
    {  
        "nrs": [1, 2, 3, None, 5],  
        "names": ["foo", "ham", "spam", "egg", "spam"],  
        "random": np.random.rand(5),  
        "groups": ["A", "A", "B", "C", "B"],  
    }  
)  
df_conditional = df.select(  
    pl.col("nrs"),  
    pl.when(pl.col("nrs") > 2)  
    .then(True)  
    .otherwise(pl.lit(False))  
    .alias("conditional"),  
)  
print(df_conditional)  
  
#pl.lit是将输入数据转化为文本的函数，不过没什么用，不加的话还没发现有什么影响

pl.lit是将输入数据转化为文本的函数，不过没什么用，不加的话还没发现有什么影响

picture.image

3、数据类型转换

初始数据


        
          
#初始数据  
df = pl.DataFrame(  
    {  
        "integers": [1, 2, 3, 4, 5],  
        "big\_integers": [1, 10000002, 3, 10000004, 10000005],  
        "floats": [4.0, 5.0, 6.0, 7.0, 8.0],  
        "floats\_with\_decimal": [4.532, 5.5, 6.5, 7.5, 8.5],  
    }  
)  
  
一般用cast  
#一般用cast  
out = df.select(  
    pl.col("integers").cast(pl.Float32).alias("integers\_as\_floats"),  
    pl.col("floats").cast(pl.Int32).alias("floats\_as\_integers"),  
    pl.col("floats\_with\_decimal")  
    .cast(pl.Int32)  
    .alias("floats\_with\_decimal\_as\_integers"),  
)  
print(out)    
#各种数据类型前面说过了，如果还是有疑问直接问AI就能完美解答  
  
#对于str转日期类型或者日期转str类型有以下两种方法  
df = pl.DataFrame(  
    {  
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 5), eager=True),  
        "string": [              "2022-01-01",              "2022-01-02",              "2022-01-03",              "2022-01-04",              "2022-01-05",          ],  
    }  
)  
out = df.select(  
    pl.col("date").dt.to_string("%Y-%m-%d"),  
    pl.col("string").str.to_datetime("%Y-%m-%d"),  
)  
print(out)

picture.image

4、处理字符串类型的数据

这里要用到str 的API先了解，后面的进阶篇会详细教


        
          
#这里要用到str 的API先了解，后面的进阶篇会详细教  
df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rab$bit", None]})  
print(df)  
#求长度  
out = df.select(  
    pl.col("animal").str.len_bytes().alias("byte\_count"),  
    pl.col("animal").str.len_chars().alias("letter\_count"),  
)  
print(out)  
  
#筛选符合str条件的行  
out = df.select(  
    pl.col("animal").str.contains("cat|bit").alias("regex"),   #contain()，列的值中是否包含指定str  
    pl.col("animal").str.contains("rab$", literal=True).alias("literal"),  
    pl.col("animal").str.starts_with("rab").alias("starts\_with"),   #starts\_with()，是否以指定str开头  
    pl.col("animal").str.ends_with("dog").alias("ends\_with"),   #starts\_with()，是否以指定str结尾  
)  
print(out)   #这里select出来的是布尔值将select改成filter即可实现筛选

picture.image


        
          
#提取指定的数值  
df = pl.DataFrame(  
    {  
        "a": [  
            "http://vote.com/ballon\_dor?candidate=messi&ref=polars",  
            "http://vote.com/ballon\_dor?candidat=jorginho&ref=polars",  
            "http://vote.com/ballon\_dor?candidate=ronaldo&ref=polars",  
        ]  
    }  
)  
out = df.select(  
    pl.col("a").str.extract(r"candidate=(\w+)", group_index=1),  #用extract()提取‘candidate=’后的首组英文字母str  
)  
print(out)  
#提取指定的所有数值,并且生成List  
df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"]})  
out = df.select(  
    pl.col("foo").str.extract_all(r"(\d+)").alias("extracted\_nrs"),   #用extract\_all()提取所有数字，并且生成List  
)  
print(out)

picture.image

替换值


        
          
#替换值  
df = pl.DataFrame({"id": [1, 2], "text": ["123abc", "abc456"]})  
out = df.with_columns(  
    pl.col("text").str.replace(r"abc", "ABC"),  
    pl.col("text").str.replace_all("a", "-", literal=True).alias("text\_replace\_all"),  
)  
print(out)  
#这个可以一定程度上替代when().then().otherwise()语句的功能

picture.image

5、类别数据提高性能

我们很多时候都把一些表示类别的数据用str数据类型来表示，但是这样表示的弊端是，每个值要占用完全的空间。而类别数据是将一列中所有的类别编码，然后这一列储存的数据其实是一堆编码代表原来的数值，在你打印表格的时候才会把编码换成原来的值。这样做明显地节约了性能。

Polars有两种类别数据类型Enum 和 Categorical，两者的差别是，Enum使用之前要声明类别，也就是说你一开始就知道这一列有几种类别，而Categorical不是，它是自动求类别。具体看下面：

Enum得提前声明类别


        
          
#Enum得提前声明类别  
enum_dtype = pl.Enum(["Polar", "Panda", "Brown"])  
enum_series = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=enum_dtype)  
#Categorical不是  
cat_series = pl.Series(  
    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical  
)  
#一般用Categorical  
#类别数据的拼接很麻烦，因为程序不知道两个拼接数据的类别是否有不同比如有类别缺少或者类别增多。所以在拼接前建议先把类别数据转化成字符串类型

6、用Lazyframe进行集成操作

读取数据之前，预先定义各个列的数据类型


        
          
#读取数据之前，预先定义各个列的数据类型  
url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv"  
  
schema_overrides = {  
    "first\_name": pl.Categorical,  
    "gender": pl.Categorical,  
    "type": pl.Categorical,  
    "state": pl.Categorical,  
    "party": pl.Categorical,  
}  
  
dataset = pl.read_csv(url, schema_overrides=schema_overrides).with_columns(  
    pl.col("birthday").str.to_date(strict=False)  
)  
print(dataset)

picture.image


        
          
#用Lazyframe进行操作这个表加载时间比较久  
q = (  
    dataset.lazy()  #把Dataframe变成Lazyframe  
    .group_by("first\_name")   #按‘first\_name’分组  
    .agg(  
        pl.len(),   #每组长度  
        pl.col("gender"),     
        pl.first("last\_name"),   #等于pl.col('last\_name').first()，返回分组后该列每组的第一个值  
    )  
    .sort("len", descending=True)  #按‘len’分组，降序排列  
    .limit(5)  #只显示前五行  
)  
  
df = q.collect()  #最后用collect把Lazyframe变成Dataframe  
print(df)

picture.image

7、缺失值处理

polars的缺失值只有‘null’ (就是pl.Null) ，NaN代表浮点类型，不表示缺失值。（小提示：在使用read_csv()或者scan_csv()读取文件时，有个参数叫'null_values',接收单个值和列表，用来指定什么数据会被识别为空值，请让null_values=‘null’，来提前把文件中数据为null的字符串类型归为空值，否则读取文件的时候，文件中的null字符串会不会被识别为空值。）

统计空值的方法


        
          
#统计空值的方法  
df = pl.DataFrame(  
    {  
        "value": [1, None,5,2,4,5,None],  
    },  
)  
print(df)  
  
null_count_df = df.null_count()  #null\_count统计表格中的空值数量  
print(null_count_df)  
  
is_null_series = df.select(  
    pl.col("value").is_null(),  #判断该列数据是否是空值可以在后面用sum()来计算总数  
)  
print(is_null_series)

picture.image

填充空值的方法


        
          
#填充空值的方法  
df = pl.DataFrame(  
    {  
        "col1": [1, 2, 3],  
        "col2": [1, None, 3],  
    },  
)  
print(df)  
  
fill_literal_df = df.with_columns(  
    pl.col("col2").fill_null(pl.lit(2)),  #fill\_null()填充空值,方法的参数在进阶篇会详细介绍,对于Dataframe也有这个方法。pl.lit()其实可以去掉。如果填充空值失败的话那么列里的null其实不是空值，而是字符串null。  
)  
print(fill_literal_df)  
  
fill_interpolation_df = df.with_columns(  
    pl.col("col2").interpolate(),  #interpolate()是插值方法法的参数在进阶篇会详细介绍。  
)  
print(fill_interpolation_df)

picture.image

处理NaN数据（注意NaN不算空值，是一种神秘的float类型）


        
          
#处理NaN数据（注意NaN不算空值，是一种神秘的float类型）  
nan_df = pl.DataFrame(  
    {  
        "value": [1.0, np.nan, float("nan"), 3.0],  
    },  
)  
print(nan_df)  
  
mean_nan_df = nan_df.with_columns(  
    pl.col("value").fill_nan(None).alias("value"),  #fill\_nan()用None填充NaN，得到null  
)  
print(mean_nan_df)

picture.image

8、窗口函数over()（熟悉SQL的对这个应该不陌生）


        
          
import polars as pl  
  
df = pl.read_csv(  
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"  
)  
print(df.head())

picture.image


        
          
out = df.select(  
    "Type 1",  
    "Type 2",     #前面说过，单单选中一个列的话可以不用pl.col()  
    pl.col("Attack").mean().over("Type 1").alias("avg\_attack\_by\_type"),   #over()实现分组聚合而且又没有groupby那样伤害分组列，这里over的执行顺序在mean之前确保分组后再reduction  
    pl.col("Defense").mean().over(["Type 1", "Type 2"]).alias("avg\_defense\_by\_type\_combination"),   #partition by多列  
    pl.col("Attack").mean().alias("avg\_attack"),  
)  
print(out)

picture.image

9、fold()实现按行操作（相当于pandas的apply(axis=1)）


        
          
df = pl.DataFrame(  
    {  
        "a": [1, 2, 3],  
        "b": [10, 20, 30],  
    }  
)  
out = df.select(  
    pl.fold(acc=pl.lit(0), function=lambda acc, x: acc + x, exprs=pl.all()).alias("sum")  #acc是初始值function是函数执行函数，exprs是列表达式（可以是列，也可是是列运算）  
)  
print(out)

picture.image

横向条件


        
          
#横向条件  
df = pl.DataFrame(  
    {  
        "a": [1, 2, 3],  
        "b": [0, 1, 2],  
    }  
)  
  
out = df.filter(  
    pl.fold(  
        acc=pl.lit(True),  
        function=lambda acc, x: acc & x,  
        exprs=pl.col("*") > 1,    #改行的所有行必须全大于一，返回的值才为True，进而该行会被filter筛选出来  
    )  
)  
print(out)

picture.image

横向拼接


        
          
#横向拼接  
df = pl.DataFrame(  
    {  
        "a": ["a", "b", "c"],  
        "b": [1, 2, 3],  
    }  
)  
  
out = df.select(pl.concat_str(["a", "b"]))  #b列开始是Int类型，但是在用concat\_str()拼接的时候会自动转化为str类型  
print(out)

picture.image

10、List和Array

前面groupby那里我们简单看过List结构，Polars给这个API写了一堆方法，让我们有很大的操作空间，整体看List结构还是不错的。Polars 还有一个 Array 数据类型，类似于 NumPy 的 ndarray 对象，其中行之间的长度是相同的。废话不多说，看下面：


        
          
weather = pl.DataFrame(  
    {  
        "station": ["Station " + str(x) for x in range(1, 6)],  
        "temperatures": [  
            "20 5 5 E1 7 13 19 9 6 20",  
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",  
            "19 24 E9 16 6 12 10 22",  
            "E2 E0 15 7 8 10 E1 24 17 13 6",  
            "14 8 E0 16 22 24 E1",  
        ],  
    }  
)  
print(weather)

picture.image


        
          
out = weather.with_columns(pl.col("temperatures").str.split(" "))  #str.split()方法就返回List这个可以不用记会后面进阶偏会有  
print(out)

picture.image


        
          
out = weather.with_columns(pl.col("temperatures").str.split(" ")).explode("temperatures")   #explode()方法可以吧List转化成行  
print(out)

picture.image


        
          
out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(  
    pl.col("temperatures").list.head(3).alias("top3"),   #前三个  
    pl.col("temperatures").list.slice(-3, 3).alias("bottom\_3"),  #后三个  
    pl.col("temperatures").list.len().alias("obs"),  #求List长度  
)  
print(out)

picture.image

遍历List


        
          
#遍历List  
out = weather.with_columns(  
    pl.col("temperatures")  
    .str.split(" ")  
    .list.eval(pl.element().str.contains("(?i)[a-z]"))   #list.eval()用来遍历，pl.element()表示List中的每一个值,相当于一个代词  
    .list.sum()  
    .alias("errors")  
)  
print(out)

picture.image

11、Polars还有一种叫struct的结构

这种结构形式就是表格的每行的值是字典结构,废话不多说，直接看：


        
          
#在下面的代码中我们就创建了一个struct为内容的Series,前面Series讲的不多这里再提一嘴，Series可以看成一个numpy数组，不过用的不多，就跟pandas  
#的Series一样平时你也很少单独用吧  
  
import polars as pl  
rating_series = pl.Series(  
    "ratings",  
    [  
        {"Movie": "Cars", "Theatre": "NE", "Avg\_Rating": 4.5},  
        {"Movie": "Toy Story", "Theatre": "ME", "Avg\_Rating": 4.9},  
    ],  
)  
print(rating_series)

picture.image

创建初始表格


        
          
#创建初始表格  
ratings = pl.DataFrame(  
    {  
        "Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],  
        "Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],  
        "Avg\_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],  
        "Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 26],  
    }  
)  
print(ratings)  
  
#在使用value\_counts的时候我们就会遇到struct类型，为什么Polars要这样干呢？ 应该是为了节省性能那么怎么吧这个struct拆成两列呢？  
out = ratings.select(pl.col("Theatre").value_counts(sort=True))  
print(out)

picture.image


        
          
#拆开struct  
out = ratings.select(pl.col("Theatre").value_counts(sort=True)).unnest("Theatre")  #使用unnest()方法就能简单拆开，关于struct了解这么多就够了  
print(out)

picture.image

12、Join横向连接表格

picture.image

创建两个表格


        
          
#创建两个表格  
df1=pl.DataFrame({  
    'property\_name':['Old Ken Road','Whitechapel Road','The Shire','Kings Cross Station','The Angel, Islington'],  
    'group':['brown','brown','fantasy','stations','light\_blue']  
})  
print(df1)  
  
df2=pl.DataFrame({  
    'property\_name':['Old Ken Road','Whitechapel Road','Sesame Street','Kings Cross Station','The Angel, Islington'],  
    'cost':['60','60','100','200','100']  
})  
print(df2)

picture.image

拼接


        
          
#拼接  
df3=df1.join(df2,on='property\_name',how='full')  #全连接  
print('全连接',df3,'\n','------------------------------------------------------------------------------------------------------------')  
df3=df1.join(df2,on='property\_name',how='inner')  #内连接  
print('内连接',df3,'\n','------------------------------------------------------------------------------------------------------------')  
df3=df1.join(df2,on='property\_name',how='left')  #左连接  
print('左连接',df3,'\n','------------------------------------------------------------------------------------------------------------')  
df3=df1.join(df2,on='property\_name',how='right')  #右连接  
print('右连接',df3,'\n','------------------------------------------------------------------------------------------------------------')  
df3=df1.join(df2,on='property\_name',how='semi')  #semi连接  
print('semi连接',df3,'\n','------------------------------------------------------------------------------------------------------------')  
df3=df1.join(df2,on='property\_name',how='anti')  #anti连接  
print('anti连接',df3,'\n','------------------------------------------------------------------------------------------------------------')  
df3=df1.join(df2,how='cross')  #笛卡尔积  
print('笛卡尔积',df3,'\n','------------------------------------------------------------------------------------------------------------')  
  
#除了on，how参数外Join()方法还有left\_on和right\_on参数用于两个表格主键列名不同的时候使用。它们和on是互斥的

除了on，how参数外Join()方法还有left_on和right_on参数用于两个表格主键列名不同的时候使用。它们和on是互斥的

picture.image

13、concat拼接表格

concat有很多拼接方式，照顾到现在是在讲基础，我就简单讲讲横向和竖向连接，反正大家知道怎么用就行，后面的进阶篇会详细说一下。

竖向vertical拼接


        
          
#竖向vertical拼接  
df_v1 = pl.DataFrame({"a": [1],"b": [3],})  
df_v2 = pl.DataFrame({"a": [2],"b": [4],})  
  
df_vertical_concat = pl.concat([df_v1,df_v2,],how="vertical") #vertical拼接  
print(df_vertical_concat)

picture.image

横向horizontal拼接


        
          
#横向horizontal拼接  
df_h1 = pl.DataFrame({"l1": [1, 2],"l2": [3, 4],})  
df_h2 = pl.DataFrame({"r1": [5, 6],"r2": [7, 8],"r3": [9, 10],})  
  
df_horizontal_concat = pl.concat([df_h1,df_h2,],how="horizontal",)  
print(df_horizontal_concat)

picture.image

14、Pivots 列转行

列转行确实用的也不太频繁，但是这个确实需要知道，废话不多说：


        
          
#创建数据表  
df = pl.DataFrame(  
    {  
        "foo": ["A", "A", "B", "B", "C",'A'],  
        "N": [1, 2, 2, 4, 2, 6],  
        "bar": ["k", "l", "m", "n", "o", "l"],  
    }  
)  
print(df)  
#转换  
out = df.pivot("bar", index="foo", values="N", aggregate_function="first")  
print(out)  
  
#观察结果我们发现，相比于原来的表格，bar列被转换为了列（列名就是原来bar列中的唯一值）,foo列则唯一化变成了index列。而N列的值则按定位分配变成了单元格值。  
#aggregate\_function的作用是当有一组有两个和以上的值时，怎么进行reduction让它们变成一个值。比如原来表格中bar列有两个‘l’对应‘A’所以该组会有两个值但是pivot后的表格  
#相应位置只能装一个值为了reduction我们有 {‘min’, ‘max’, ‘first’, ‘last’, ‘sum’, ‘mean’, ‘median’, ‘len’}等值可以选择

观察结果我们发现，相比于原来的表格，bar列被转换为了列（列名就是原来bar列中的唯一值）,foo列则唯一化变成了index列。而N列的值则按定位分配变成了单元格值。#aggregate_function的作用是当有一组有两个和以上的值时，怎么进行reduction让它们变成一个值。比如原来表格中bar列有两个‘l’对应‘A’所以该组会有两个值但是pivot后的表格 #相应位置只能装一个值为了reduction我们有 {‘min’, ‘max’, ‘first’, ‘last’, ‘sum’, ‘mean’, ‘median’, ‘len’}等值可以选择

picture.image

15、Unpivots列转行


        
          
import polars as pl  
  
df = pl.DataFrame(  
    {  
        "A": ["a", "b", "a"],  
        "B": [1, 3, 5],  
        "C": [10, 11, 12],  
        "D": [2, 4, 6],  
    }  
)  
print(df)  
  
#进行Unpivots  
out = df.unpivot(["C", "D"], index=["A", "B"])  
print(out)  
#观察结果，我们发现，C,D两列的列名变成了variable列的值，原来两列的值变成了value列的值A，B列则变成的index列  
#关于列转行了解这么多就行了

观察结果，我们发现，C,D两列的列名变成了variable列的值，原来两列的值变成了value列的值A，B列则变成的index列，关于列转行了解这么多就行了

picture.image

16、时间序列分析

Polars的datatype有以下几种：

名称	解释
date	举例：2014-07-08
datetime	举例：2014-07-08 07:00:00
duration	类似于python中的timedelta（时间差）
Time	ns表示

创建初始表格


        
          
#创建初始表格  
df=pl.DataFrame({  
    'Date':['1981-02-23','1981-05-06','1981-05-18','1981-09-25','1982-07-08'],  
    'Close':[24.62,27.38,28.0,14.25,11.0]  
})  
df = df.with_columns(pl.col("Date").str.to_date("%Y-%m-%d"))   #str.to\_date()方法之前提到过作用是将str类型的时间转化成datetype  
print(df)

picture.image

Polars还有个dt API来进行时间序列操作


        
          
#Polars还有个dt  API来进行时间序列操作  
  
df_with_year = df.with_columns(pl.col("Date").dt.year().alias("year"))  #提取年份数据  
print(df_with_year)  
#旧例，先了解就行后面进阶篇会详细讲

picture.image

按单个值做筛选


        
          
#按单个值做筛选  
from datetime import datetime  
filtered_df = df.filter(  
    pl.col("Date") == datetime(1981, 5, 6),  
)  
print(filtered_df)  
  
#按daterange进行筛选  
filtered_range_df = df.filter(  
    pl.col("Date").is_between(datetime(1981, 1, 1), datetime(1981, 6, 1)), #is\_between可以记住  
)  
print(filtered_range_df)

picture.image

时间窗口分析group_by_dynamic，这个重要点，记住吧


        
          
#时间窗口分析group\_by\_dynamic，这个重要点，记住吧  
from datetime import date  
df = (  
    pl.date_range(  
        start=date(2021, 1, 1),  
        end=date(2021, 12, 31),  
        interval="1d",  
        eager=True,  
    )  
    .alias("time")  
    .to_frame()  #这个方法用来把date\_range()生成的Series转变成Dataframe  
)  
print(df)  
  
out = df.group_by_dynamic("time", every="1mo", period="1mo", closed="left").agg(  
    pl.col("time").cum_count().reverse().head(3).alias("day/eom"),   #cum\_count()作用是给该列每个非空值从前到后进行编号，reverse用来翻转列  
    ((pl.col("time") - pl.col("time").first()).last().dt.total_days() + 1).alias("days\_in\_month"), #这串表达式表明，每个group的值减去该group的首个值生成各个duration后再求duration是多少天  
)  
print(out)

picture.image

group_by_dynamic主要参数及讲解

picture.image

最后想说的话：Polars基础内容就这么多了,我主要参考的是Polars英文官方文档。后面的特性、方法之类的细致内容我会再写一个进阶篇详细讲解。

分享、学习、投稿等欢迎加入学习交流群

picture.image

欢迎关注「数分36计OpenDogs」公众号，持续输出数分职场技能