我正在寻找空间聚类算法,以便在支持PostGIS的数据库中将其用于点要素。我将编写plpgsql函数,该函数将同一群集内的点之间的距离作为输入。在输出函数处返回集群数组。最明显的解决方案是在特征周围建立指定距离的缓冲区,并在该缓冲区中搜索特征。如果存在此类功能,则继续在其周围构建缓冲区,等等。如果不存在此类功能,则意味着集群构建已完成。也许有一些聪明的解决方案?
我正在寻找空间聚类算法,以便在支持PostGIS的数据库中将其用于点要素。我将编写plpgsql函数,该函数将同一群集内的点之间的距离作为输入。在输出函数处返回集群数组。最明显的解决方案是在特征周围建立指定距离的缓冲区,并在该缓冲区中搜索特征。如果存在此类功能,则继续在其周围构建缓冲区,等等。如果不存在此类功能,则意味着集群构建已完成。也许有一些聪明的解决方案?
Answers:
PostGIS至少有两种良好的聚类方法:k-均值(通过kmeans-postgresql
扩展)或阈值距离内的聚类几何(PostGIS 2.2)
kmeans-postgresql
安装:您需要在POSIX主机系统上安装 PostgreSQL 8.4或更高版本(我不知道从何处开始安装MS Windows)。如果您是从软件包中安装的,请确保您也具有开发软件包(例如,postgresql-devel
对于CentOS)。下载并解压缩:
wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
在构建之前,您需要设置USE_PGXS
环境变量(我的上一篇文章指示删除的这部分Makefile
,这不是最好的选择)。这两个命令之一应适用于Unix shell:
# bash
export USE_PGXS=1
# csh
setenv USE_PGXS 1
现在构建并安装扩展:
make
make install
psql -f /usr/share/pgsql/contrib/kmeans.sql -U postgres -D postgis
(注意:我也曾在Ubuntu 10.10上尝试过此操作,但没有运气,因为其中的路径pg_config --pgxs
不存在!这可能是Ubuntu打包错误)
用法/示例:您应该在某处有一个点表(我在QGIS中绘制了一堆伪随机点)。这是我所做的一个示例:
SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;
窗口函数5
第二个参数中提供的I kmeans
是产生五个簇的K整数。您可以将其更改为所需的任何整数。
以下是我绘制的31个伪随机点和五个质心,其中的标签显示了每个群集中的计数。这是使用上面的SQL查询创建的。
您也可以尝试使用ST_MinimumBoundingCircle来说明这些群集的位置:
SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;
ST_ClusterWithin
该聚合函数包含在PostGIS 2.2中,并返回一个GeometryCollections数组,其中所有组件之间的距离都在一定范围内。
这是一个示例用法,其中距离100.0是导致5个不同群集的阈值:
SELECT row_number() over () AS id,
ST_NumGeometries(gc),
gc AS geom_collection,
ST_Centroid(gc) AS centroid,
ST_MinimumBoundingCircle(gc) AS circle,
sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
SELECT unnest(ST_ClusterWithin(geom, 100)) gc
FROM rand_point
) f;
最大的中间簇的包围圆半径为65.3单位或大约130,大于阈值。这是因为成员几何之间的单个距离小于阈值,因此将其捆绑为一个较大的簇。
我编写了一个函数,该函数根据特征之间的距离计算特征簇,并在这些特征上构建凸包:
CREATE OR REPLACE FUNCTION get_domains_n(lname varchar, geom varchar, gid varchar, radius numeric)
RETURNS SETOF record AS
$$
DECLARE
lid_new integer;
dmn_number integer := 1;
outr record;
innr record;
r record;
BEGIN
DROP TABLE IF EXISTS tmp;
EXECUTE 'CREATE TEMPORARY TABLE tmp AS SELECT '||gid||', '||geom||' FROM '||lname;
ALTER TABLE tmp ADD COLUMN dmn integer;
ALTER TABLE tmp ADD COLUMN chk boolean DEFAULT FALSE;
EXECUTE 'UPDATE tmp SET dmn = '||dmn_number||', chk = FALSE WHERE '||gid||' = (SELECT MIN('||gid||') FROM tmp)';
LOOP
LOOP
FOR outr IN EXECUTE 'SELECT '||gid||' AS gid, '||geom||' AS geom FROM tmp WHERE dmn = '||dmn_number||' AND NOT chk' LOOP
FOR innr IN EXECUTE 'SELECT '||gid||' AS gid, '||geom||' AS geom FROM tmp WHERE dmn IS NULL' LOOP
IF ST_DWithin(ST_Transform(ST_SetSRID(outr.geom, 4326), 3785), ST_Transform(ST_SetSRID(innr.geom, 4326), 3785), radius) THEN
--IF ST_DWithin(outr.geom, innr.geom, radius) THEN
EXECUTE 'UPDATE tmp SET dmn = '||dmn_number||', chk = FALSE WHERE '||gid||' = '||innr.gid;
END IF;
END LOOP;
EXECUTE 'UPDATE tmp SET chk = TRUE WHERE '||gid||' = '||outr.gid;
END LOOP;
SELECT INTO r dmn FROM tmp WHERE dmn = dmn_number AND NOT chk LIMIT 1;
EXIT WHEN NOT FOUND;
END LOOP;
SELECT INTO r dmn FROM tmp WHERE dmn IS NULL LIMIT 1;
IF FOUND THEN
dmn_number := dmn_number + 1;
EXECUTE 'UPDATE tmp SET dmn = '||dmn_number||', chk = FALSE WHERE '||gid||' = (SELECT MIN('||gid||') FROM tmp WHERE dmn IS NULL LIMIT 1)';
ELSE
EXIT;
END IF;
END LOOP;
RETURN QUERY EXECUTE 'SELECT ST_ConvexHull(ST_Collect('||geom||')) FROM tmp GROUP by dmn';
RETURN;
END
$$
LANGUAGE plpgsql;
使用此功能的示例:
SELECT * FROM get_domains_n('poi', 'wkb_geometry', 'ogc_fid', 14000) AS g(gm geometry)
'poi'-图层名称,'wkb_geometry'-几何列名称,'ogc_fid'-表的主键,14000-群集距离。
使用此功能的结果:
geometry
在表中构建列,而不是单独存储lonlat并使列具有唯一值(ID)。
到目前为止,我发现最有前途的是作为窗口函数的K-means聚类的扩展:http : //pgxn.org/dist/kmeans/
但是,我还无法成功安装它。
否则,对于基本的网格集群,可以使用SnapToGrid。
SELECT
array_agg(id) AS ids,
COUNT( position ) AS count,
ST_AsText( ST_Centroid(ST_Collect( position )) ) AS center,
FROM mytable
GROUP BY
ST_SnapToGrid( ST_SetSRID(position, 4326), 22.25, 11.125)
ORDER BY
count DESC
;
补充@MikeT答案...
对于MS Windows:
要求:
你会做什么:
cl.exe
编译器编译源代码以生成具有kmeans
功能的DLL 。脚步:
kmeans.c
在任何编辑器中打开:
#include
行后用以下命令定义DLLEXPORT宏:
#if defined(_WIN32)
#define DLLEXPORT __declspec(dllexport)
#else
#define DLLEXPORT
#endif
DLLEXPORT
在以下各行之前放置:
PG_FUNCTION_INFO_V1(kmeans_with_init);
PG_FUNCTION_INFO_V1(kmeans);
extern Datum kmeans_with_init(PG_FUNCTION_ARGS);
extern Datum kmeans(PG_FUNCTION_ARGS);
打开Visual C ++命令行。
在命令行中:
kmeans-postgresql
。SET POSTGRESPATH=C:\Program Files\PostgreSQL\9.5
跑
cl.exe /I"%POSTGRESPATH%\include" /I"%POSTGRESPATH%\include\server" /I"%POSTGRESPATH%\include\server\port\win32" /I"%POSTGRESPATH%\include\server\port\win32_msvc" /I"C:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Include" /LD kmeans.c "%POSTGRESPATH%\lib\postgres.lib"
复制kmeans.dll
到%POSTGRESPATH%\lib
现在,在数据库中运行SQL命令以“创建”该函数。
CREATE FUNCTION kmeans(float[], int) RETURNS int
AS '$libdir/kmeans'
LANGUAGE c VOLATILE STRICT WINDOW;
CREATE FUNCTION kmeans(float[], int, float[]) RETURNS int
AS '$libdir/kmeans', 'kmeans_with_init'
LANGUAGE C IMMUTABLE STRICT WINDOW;
这是一种在QGIS中显示此ananser中2)中给出的PostGIS查询结果的方法
由于QGIS在同一几何列中既不处理几何集合也不处理不同的数据类型,因此我创建了两层,一层用于聚类,一层用于聚类点。
首先,对于簇,您只需要多边形,其他结果就是孤独点:
SELECT id,countfeature,circle FROM (SELECT row_number() over () AS id,
ST_NumGeometries(gc) as countfeature,
ST_MinimumBoundingCircle(gc) AS circle
FROM (
SELECT unnest(ST_ClusterWithin(the_geom, 100)) gc
FROM rand_point
) f) a WHERE ST_GeometryType(circle) = 'ST_Polygon'
然后,对于聚集点,您需要在多点中转换geometrycollection:
SELECT row_number() over () AS id,
ST_NumGeometries(gc) as countfeature,
ST_CollectionExtract(gc,1) AS multipoint
FROM (
SELECT unnest(ST_ClusterWithin(the_geom, 100)) gc
FROM rand_point
) f
有些点在同一坐标上,因此标签可能会造成混淆。
您可以使用ST_ClusterKMeans方法更轻松地使用Kmeans解决方案,该方法在2.3版本的postgis中可用。示例:
SELECT kmean, count(*), ST_SetSRID(ST_Extent(geom), 4326) as bbox
FROM
(
SELECT ST_ClusterKMeans(geom, 20) OVER() AS kmean, ST_Centroid(geom) as geom
FROM sls_product
) tsub
GROUP BY kmean;
在上面的示例中,要素的边界框用作群集几何体。第一张图片显示了原始几何形状,第二张图片是上面选择的结果。
自下而上的聚类解决方案从postgis中的最大直径的点云中获取单个聚类,不涉及动态查询。
CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))
和具有集群ID的类型
CREATE TYPE clustered_pt AS (
gid character varying(32),
the_geom geometry(Point)
cluster_id int)
接下来的算法功能
CREATE OR REPLACE FUNCTION buc(points pt[], radius integer)
RETURNS SETOF clustered_pt AS
$BODY$
DECLARE
srid int;
joined_clusters int[];
BEGIN
--If there's only 1 point, don't bother with the loop.
IF array_length(points,1)<2 THEN
RETURN QUERY SELECT gid, the_geom, 1 FROM unnest(points);
RETURN;
END IF;
CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;
BEGIN
ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION
WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;
TRUNCATE points2;
--inserting points in
INSERT INTO points2(gid, the_geom)
(SELECT (unnest(points)).* );
--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;
UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom = ST_TRANSFORM(the_geom,26986);
--Adding spatial index
CREATE INDEX points_index
ON points2
USING gist
(the_geom);
ANALYZE points2;
LOOP
--If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) FROM points2 a, points2 b
WHERE a.cluster_id <> b.cluster_id
GROUP BY a.cluster_id, b.cluster_id
ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)
> 2 * radius
THEN
EXIT;
END IF;
joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
FROM points2 a, points2 b
WHERE a.cluster_id <> b.cluster_id
GROUP BY a.cluster_id, b.cluster_id
ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))
LIMIT 1;
UPDATE points2
SET cluster_id = joined_clusters[1]
WHERE cluster_id = joined_clusters[2];
--If there's only 1 cluster left, exit loop
IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
EXIT;
END IF;
END LOOP;
RETURN QUERY SELECT gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql
用法:
WITH subq AS(
SELECT ARRAY_AGG((gid, the_geom)::pt) AS points
FROM data
GROUP BY collection_id)
SELECT (clusters).* FROM
(SELECT buc(points, radius) AS clusters FROM subq
) y;