geopandas空间连接极慢

13

我正在使用下面的代码来查找数百万个GPS点的国家（有时是州）。该代码当前每点大约需要一秒钟，这非常慢。shapefile为6 MB。

我读到geopandas使用rtree进行空间连接，这使它们效率极高，但这在这里似乎不起作用。我究竟做错了什么？我希望每秒能获得一千个积分。

可以在此处（5MB）下载shapefile和csv：https ://www.dropbox.com/s/gdkxtpqupj0sidm/SpatialJoin.zip ? dl =0

import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, read_file
from geopandas.tools import sjoin
from shapely.geometry import Point, mapping,shape
import time


#parameters
shapefile="K:/.../Shapefiles/Used/World.shp"
df=pd.read_csv("K:/.../output2.csv",index_col=None,nrows=20)# Limit to 20 rows for testing    

if __name__=="__main__":
    start=time.time()
    df['geometry'] = df.apply(lambda z: Point(z.Longitude, z.Latitude), axis=1)
    PointsGeodataframe = gpd.GeoDataFrame(df)
    PolygonsGeodataframe = gpd.GeoDataFrame.from_file(shapefile)
    PointsGeodataframe.crs = PolygonsGeodataframe.crs
    print time.time()-start
    merged=sjoin(PointsGeodataframe, PolygonsGeodataframe, how='left')
    print time.time()-start
    merged.to_csv("K:/01. Personal/04. Models/10. Location/output.csv",index=None)
    print time.time()-start

— 亚历克西斯·埃格蒙特
source

您的数据链接是404

— 亚伦

16

在sjoin函数中添加参数op ='within'可以大大加快多边形点的操作速度。

默认值为op ='intersects'，我猜这也会导致正确的结果，但是慢100到1000倍。

— 亚历克西斯·埃格蒙特
source

任何人读这篇文章，这并不意味着within是一般某种程度上更快，阅读下面nick_g的答案。

— inc42

7

该问题询问如何在geopandas空间连接中利用r树，另一个响应者正确地指出您应该使用“内部”而不是“相交”。但是，您也可以在使用intersects/时在geopandas中利用r树空间索引intersection，如本geopandas r树教程中所示：

spatial_index = gdf.sindex
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(polygon)]

— eos
source

5

什么是有可能会在这里的是，只有右边的数据帧被送入RTREE指数： https://github.com/geopandas/geopandas/blob/master/geopandas/tools/sjoin.py#L48-L55 哪一个op="intersects"运行将意味着将多边形输入索引，因此对于每个点，都可以通过rtree索引找到相应的多边形。

但对于op="within"，地理数据框却被翻转了，因为该操作实际上是与之相反的contains：https : //github.com/geopandas/geopandas/blob/master/geopandas/tools/sjoin.py#L41-L43

所以，当你切换所发生的op从op="intersects"以op="within"是，对于每一个多边形，对应的点是通过RTREE指数，而你的情况加快了查询中找到。

— nick_g
source

1

您使用了非永久性的URL，是否可以将它们更新为特定的版本？

— inc42