使用OGR和Shapely更有效吗？[关闭]

29

我正在寻找有关如何提高我的python代码效率的一些建议。通常，效率对我来说并不重要，但是我现在正在处理一个美国地点超过150万点的文本文件。使用给定的设置，在一点上运行操作大约需要5秒钟；我需要把这个数字降下来。

我正在使用三个不同的python GIS软件包对这些点进行一些不同的操作，并输出一个新的带分隔符的文本文件。

我使用OGR读取县边界shapefile并访问边界几何。
匀称检查某个点是否在这些县中的任何一个县内。
如果在1之内，则使用Python Shapefile库从边界.dbf中提取属性信息。
然后，我将两个来源的一些信息写入文本文件。

我怀疑效率低下在于存在2-3层的循环...不太清楚该怎么做。我特别希望与有经验的人一起使用这3个软件包中的任何一个，因为这是我第一次使用它们。

import os, csv
from shapely.geometry import Point
from shapely.geometry import Polygon
from shapely.wkb import loads
from osgeo import ogr
import shapefile

pointFile = "C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\NationalFile_20110404.txt"
shapeFolder = "C:\NSF_Stuff\NLTK_Scripts\Gazetteer_New"
#historicBounds = "C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\US_Counties_1860s_NAD"
historicBounds = "US_Counties_1860s_NAD"
writeFile = "C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\NewNational_Gazet.txt"

#opens the point file, reads it as a delimited file, skips the first line
openPoints = open(pointFile, "r")
reader = csv.reader(openPoints, delimiter="|")
reader.next()

#opens the write file
openWriteFile = open(writeFile, "w")

#uses Python Shapefile Library to read attributes from .dbf
sf = shapefile.Reader("C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\US_Counties_1860s_NAD.dbf")
records = sf.records()
print "Starting loop..."

#This will loop through the points in pointFile    
for row in reader:
    print row
    shpIndex = 0
    pointX = row[10]
    pointY = row[9]
    thePoint = Point(float(pointX), float(pointY))
    #This section uses OGR to read the geometry of the shapefile
    openShape = ogr.Open((str(historicBounds) + ".shp"))
    layers = openShape.GetLayerByName(historicBounds)
    #This section loops through the geometries, determines if the point is in a polygon
    for element in layers:
        geom = loads(element.GetGeometryRef().ExportToWkb())
        if geom.geom_type == "Polygon":
            if thePoint.within(geom) == True:
                print "!!!!!!!!!!!!! Found a Point Within Historic !!!!!!!!!!!!"
                print str(row[1]) + ", " + str(row[2]) + ", " + str(row[5]) + " County, " + str(row[3])
                print records[shpIndex]
                openWriteFile.write((str(row[0]) + "|" + str(row[1]) + "|" + str(row[2]) + "|" + str(row[5]) + "|" + str(row[3]) + "|" + str(row[9]) + "|" + str(row[10]) + "|" + str(records[shpIndex][3]) + "|" + str(records[shpIndex][9]) + "|\n"))
        if geom.geom_type == "MultiPolygon":
            for pol in geom:
                if thePoint.within(pol) == True:
                    print "!!!!!!!!!!!!!!!!! Found a Point Within MultiPolygon !!!!!!!!!!!!!!"
                    print str(row[1]) + ", " + str(row[2]) + ", " + str(row[5]) + " County, " + str(row[3])
                    print records[shpIndex]
                    openWriteFile.write((str(row[0]) + "|" + str(row[1]) + "|" + str(row[2]) + "|" + str(row[5]) + "|" + str(row[3]) + "|" + str(row[9]) + "|" + str(row[10]) + "|" + str(records[shpIndex][3]) + "|" + str(records[shpIndex][9]) + "|\n"))
        shpIndex = shpIndex + 1
    print "finished checking point"
    openShape = None
    layers = None


pointFile.close()
writeFile.close()
print "Done"

— 格兰特
source

3

您可以考虑将其发布到@代码审查：codereview.stackexchange.com

— RyanDalton，2011年

21

第一步是将shapefile打开到行循环之外，您要打开和关闭shapefile 150万次。

老实说，尽管我将全部内容都填充到PostGIS中并在索引表上使用SQL来完成。

— 伊恩·特顿
source

19

快速查看您的代码可以想到一些优化：

首先对照多边形的边界框/信封检查每个点，以消除明显的离群值。您可以更进一步，算出一个点所在的bbox数量，如果恰好是一个，则不需要针对更复杂的几何体进行测试（嗯，实际上，如果它位于更多的bbox中，而不是一个，则需要进一步测试。您可以进行两遍操作，以从复杂案例中消除简单案例）。
而不是遍历每个点并针对多边形进行测试，而是遍历多边形并测试每个点。几何体的加载/转换很慢，因此您要尽可能少地做。同样，首先从CSV创建一个点列表，再次避免每次点都必须执行多次，然后在该迭代结束时丢弃结果。
对您的点进行空间索引，这涉及将其转换为shapefile，SpatialLite文件或类似PostGIS / PostgreSQL数据库的文件。这样做的好处是，像OGR这样的工具将能够为您完成大部分工作。
直到最后才写输出：print（）在最佳情况下是一个昂贵的函数。而是将数据存储为列表，并在最后使用标准Python酸洗功能或列表转储功能将其写出。

— 默西·维京
source

5

前两个将获得丰厚回报。您还可以通过对所有内容使用ogr而不是Shapely和Shapefile来加快速度。

— sgillies

2

对于任何与“ Python”和“空间索引”相关的事物，Rtree都别无所求，因为它可以很快地找到其他形状附近的形状

— Mike T