优化大型数据库查询(25+百万行,使用max()和GROUP BY)


14

我使用的是Postgres 9.3.5,并且数据库中有一个大表,目前它有2500万行,而且它往往会迅速增大。我正在尝试通过一个简单的查询来选择特定的行(所有行都unit_id只有最新的行unit_timestamp),例如:

SELECT unit_id, max(unit_timestamp) AS latest_timestamp FROM all_units GROUP BY unit_id;

没有任何索引,此查询大约需要35秒才能执行。使用定义的索引(CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);),查询时间缩短到(仅)19秒左右。

我想知道是否有可能在更短的时间内(例如几秒钟)执行我的查询,如果可以,我应该采取什么步骤来进一步优化它?

我的表结构转储如下所示:

CREATE TABLE "all_units" (
"unit_id" int4 NOT NULL,
"unit_timestamp" timestamp(6) NOT NULL,
"lon" float4,
"lat" float4,
"speed" float4,
"status" varchar(255) COLLATE "default"
)
ALTER TABLE "all_units" ADD PRIMARY KEY ("unit_id", "unit_timestamp");

EXPLAIN (ANALYZE, BUFFERS)如下:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=663998.38..664069.73 rows=7135 width=12) (actual time=84715.050..84732.021 rows=11094 loops=1)
  Buffers: shared hit=192 read=286819
  ->  Seq Scan on ais_sorted  (cost=0.00..538335.92 rows=25132492 width=12) (actual time=0.608..41264.196 rows=25132492 loops=1)
        Buffers: shared hit=192 read=286819
Total runtime: 84746.501 ms

我在服务器上的psql设置如下所示:

                name                 |  context   |  min_val  |   max_val    |                boot_val
-------------------------------------+------------+-----------+--------------+-----------------------------------------
 allow_system_table_mods             | postmaster |           |              | off
 application_name                    | user       |           |              |
 archive_command                     | sighup     |           |              |
 archive_mode                        | postmaster |           |              | off
 archive_timeout                     | sighup     | 0         | 1073741823   | 0
 array_nulls                         | user       |           |              | on
 authentication_timeout              | sighup     | 1         | 600          | 60
 autovacuum                          | sighup     |           |              | on
 autovacuum_analyze_scale_factor     | sighup     | 0         | 100          | 0.1
 autovacuum_analyze_threshold        | sighup     | 0         | 2147483647   | 50
 autovacuum_freeze_max_age           | postmaster | 100000000 | 2000000000   | 200000000
 autovacuum_max_workers              | postmaster | 1         | 8388607      | 3
 autovacuum_multixact_freeze_max_age | postmaster | 10000000  | 2000000000   | 400000000
 autovacuum_naptime                  | sighup     | 1         | 2147483      | 60
 autovacuum_vacuum_cost_delay        | sighup     | -1        | 100          | 20
 autovacuum_vacuum_cost_limit        | sighup     | -1        | 10000        | -1
 autovacuum_vacuum_scale_factor      | sighup     | 0         | 100          | 0.2
 autovacuum_vacuum_threshold         | sighup     | 0         | 2147483647   | 50
 backslash_quote                     | user       |           |              | safe_encoding
 bgwriter_delay                      | sighup     | 10        | 10000        | 200
 bgwriter_lru_maxpages               | sighup     | 0         | 1000         | 100
 bgwriter_lru_multiplier             | sighup     | 0         | 10           | 2
 block_size                          | internal   | 8192      | 8192         | 8192
 bonjour                             | postmaster |           |              | off
 bonjour_name                        | postmaster |           |              |
 bytea_output                        | user       |           |              | hex
 check_function_bodies               | user       |           |              | on
 checkpoint_completion_target        | sighup     | 0         | 1            | 0.5
 checkpoint_segments                 | sighup     | 1         | 2147483647   | 3
 checkpoint_timeout                  | sighup     | 30        | 3600         | 300
 checkpoint_warning                  | sighup     | 0         | 2147483647   | 30
 client_encoding                     | user       |           |              | SQL_ASCII
 client_min_messages                 | user       |           |              | notice
 commit_delay                        | superuser  | 0         | 100000       | 0
 commit_siblings                     | user       | 0         | 1000         | 5
 config_file                         | postmaster |           |              |
 constraint_exclusion                | user       |           |              | partition
 cpu_index_tuple_cost                | user       | 0         | 1.79769e+308 | 0.005
 cpu_operator_cost                   | user       | 0         | 1.79769e+308 | 0.0025
 cpu_tuple_cost                      | user       | 0         | 1.79769e+308 | 0.01
 cursor_tuple_fraction               | user       | 0         | 1            | 0.1
 data_checksums                      | internal   |           |              | off
 data_directory                      | postmaster |           |              |
 DateStyle                           | user       |           |              | ISO, MDY
 db_user_namespace                   | sighup     |           |              | off
 deadlock_timeout                    | superuser  | 1         | 2147483647   | 1000
 debug_assertions                    | user       |           |              | off
 debug_pretty_print                  | user       |           |              | on
 debug_print_parse                   | user       |           |              | off
 debug_print_plan                    | user       |           |              | off
 debug_print_rewritten               | user       |           |              | off
 default_statistics_target           | user       | 1         | 10000        | 100
 default_tablespace                  | user       |           |              |
 default_text_search_config          | user       |           |              | pg_catalog.simple
 default_transaction_deferrable      | user       |           |              | off
 default_transaction_isolation       | user       |           |              | read committed
 default_transaction_read_only       | user       |           |              | off
 default_with_oids                   | user       |           |              | off
 dynamic_library_path                | superuser  |           |              | $libdir
 effective_cache_size                | user       | 1         | 2147483647   | 16384
 effective_io_concurrency            | user       | 0         | 1000         | 1
 enable_bitmapscan                   | user       |           |              | on
 enable_hashagg                      | user       |           |              | on
 enable_hashjoin                     | user       |           |              | on
 enable_indexonlyscan                | user       |           |              | on
 enable_indexscan                    | user       |           |              | on
 enable_material                     | user       |           |              | on
 enable_mergejoin                    | user       |           |              | on
 enable_nestloop                     | user       |           |              | on
 enable_seqscan                      | user       |           |              | on
 enable_sort                         | user       |           |              | on
 enable_tidscan                      | user       |           |              | on
 escape_string_warning               | user       |           |              | on
 event_source                        | postmaster |           |              | PostgreSQL
 exit_on_error                       | user       |           |              | off
 external_pid_file                   | postmaster |           |              |
 extra_float_digits                  | user       | -15       | 3            | 0
 from_collapse_limit                 | user       | 1         | 2147483647   | 8
 fsync                               | sighup     |           |              | on
 full_page_writes                    | sighup     |           |              | on
 geqo                                | user       |           |              | on
 geqo_effort                         | user       | 1         | 10           | 5
 geqo_generations                    | user       | 0         | 2147483647   | 0
 geqo_pool_size                      | user       | 0         | 2147483647   | 0
 geqo_seed                           | user       | 0         | 1            | 0
 geqo_selection_bias                 | user       | 1.5       | 2            | 2
 geqo_threshold                      | user       | 2         | 2147483647   | 12
 gin_fuzzy_search_limit              | user       | 0         | 2147483647   | 0
 hba_file                            | postmaster |           |              |
 hot_standby                         | postmaster |           |              | off
 hot_standby_feedback                | sighup     |           |              | off
 ident_file                          | postmaster |           |              |
 ignore_checksum_failure             | superuser  |           |              | off
 ignore_system_indexes               | backend    |           |              | off
 integer_datetimes                   | internal   |           |              | on
 IntervalStyle                       | user       |           |              | postgres
 join_collapse_limit                 | user       | 1         | 2147483647   | 8
 krb_caseins_users                   | sighup     |           |              | off
 krb_server_keyfile                  | sighup     |           |              | FILE:/etc/postgresql-common/krb5.keytab
 krb_srvname                         | sighup     |           |              | postgres
 lc_collate                          | internal   |           |              | C
 lc_ctype                            | internal   |           |              | C
 lc_messages                         | superuser  |           |              |
 lc_monetary                         | user       |           |              | C
 lc_numeric                          | user       |           |              | C
 lc_time                             | user       |           |              | C
 listen_addresses                    | postmaster |           |              | localhost
 lo_compat_privileges                | superuser  |           |              | off
 local_preload_libraries             | backend    |           |              |
 lock_timeout                        | user       | 0         | 2147483647   | 0
 log_autovacuum_min_duration         | sighup     | -1        | 2147483647   | -1
 log_checkpoints                     | sighup     |           |              | off
 log_connections                     | backend    |           |              | off
 log_destination                     | sighup     |           |              | stderr
 log_directory                       | sighup     |           |              | pg_log
 log_disconnections                  | backend    |           |              | off
 log_duration                        | superuser  |           |              | off
 log_error_verbosity                 | superuser  |           |              | default
 log_executor_stats                  | superuser  |           |              | off
 log_file_mode                       | sighup     | 0         | 511          | 384
 log_filename                        | sighup     |           |              | postgresql-%Y-%m-%d_%H%M%S.log
 log_hostname                        | sighup     |           |              | off
 log_line_prefix                     | sighup     |           |              |
 log_lock_waits                      | superuser  |           |              | off
 log_min_duration_statement          | superuser  | -1        | 2147483647   | -1
 log_min_error_statement             | superuser  |           |              | error
 log_min_messages                    | superuser  |           |              | warning
 log_parser_stats                    | superuser  |           |              | off
 log_planner_stats                   | superuser  |           |              | off
 log_rotation_age                    | sighup     | 0         | 35791394     | 1440
 log_rotation_size                   | sighup     | 0         | 2097151      | 10240
 log_statement                       | superuser  |           |              | none
 log_statement_stats                 | superuser  |           |              | off
 log_temp_files                      | superuser  | -1        | 2147483647   | -1
 log_timezone                        | sighup     |           |              | GMT
 log_truncate_on_rotation            | sighup     |           |              | off
 logging_collector                   | postmaster |           |              | off
 maintenance_work_mem                | user       | 1024      | 2147483647   | 16384
 max_connections                     | postmaster | 1         | 8388607      | 100
 max_files_per_process               | postmaster | 25        | 2147483647   | 1000
 max_function_args                   | internal   | 100       | 100          | 100
 max_identifier_length               | internal   | 63        | 63           | 63
 max_index_keys                      | internal   | 32        | 32           | 32
 max_locks_per_transaction           | postmaster | 10        | 2147483647   | 64
 max_pred_locks_per_transaction      | postmaster | 10        | 2147483647   | 64
 max_prepared_transactions           | postmaster | 0         | 8388607      | 0
 max_stack_depth                     | superuser  | 100       | 2147483647   | 100
 max_standby_archive_delay           | sighup     | -1        | 2147483647   | 30000
 max_standby_streaming_delay         | sighup     | -1        | 2147483647   | 30000
 max_wal_senders                     | postmaster | 0         | 8388607      | 0
 password_encryption                 | user       |           |              | on
 port                                | postmaster | 1         | 65535        | 5432
 post_auth_delay                     | backend    | 0         | 2147         | 0
 pre_auth_delay                      | sighup     | 0         | 60           | 0
 quote_all_identifiers               | user       |           |              | off
 random_page_cost                    | user       | 0         | 1.79769e+308 | 4
 restart_after_crash                 | sighup     |           |              | on
 search_path                         | user       |           |              | "$user",public
 segment_size                        | internal   | 131072    | 131072       | 131072
 seq_page_cost                       | user       | 0         | 1.79769e+308 | 1
 server_encoding                     | internal   |           |              | SQL_ASCII
 server_version                      | internal   |           |              | 9.3.5
 server_version_num                  | internal   | 90305     | 90305        | 90305
 session_replication_role            | superuser  |           |              | origin
 shared_buffers                      | postmaster | 16        | 1073741823   | 1024
 shared_preload_libraries            | postmaster |           |              |
 sql_inheritance                     | user       |           |              | on
 ssl                                 | postmaster |           |              | off
 ssl_ca_file                         | postmaster |           |              |
 ssl_cert_file                       | postmaster |           |              | server.crt
 ssl_ciphers                         | postmaster |           |              | DEFAULT:!LOW:!EXP:!MD5:@STRENGTH
 ssl_crl_file                        | postmaster |           |              |
 ssl_key_file                        | postmaster |           |              | server.key
 ssl_renegotiation_limit             | user       | 0         | 2147483647   | 524288
 standard_conforming_strings         | user       |           |              | on
 statement_timeout                   | user       | 0         | 2147483647   | 0
 stats_temp_directory                | sighup     |           |              | pg_stat_tmp
 superuser_reserved_connections      | postmaster | 0         | 8388607      | 3
 synchronize_seqscans                | user       |           |              | on
 synchronous_commit                  | user       |           |              | on
 synchronous_standby_names           | sighup     |           |              |
 syslog_facility                     | sighup     |           |              | local0
 syslog_ident                        | sighup     |           |              | postgres
 tcp_keepalives_count                | user       | 0         | 2147483647   | 0
 tcp_keepalives_idle                 | user       | 0         | 2147483647   | 0
 tcp_keepalives_interval             | user       | 0         | 2147483647   | 0
 temp_buffers                        | user       | 100       | 1073741823   | 1024
 temp_file_limit                     | superuser  | -1        | 2147483647   | -1
 temp_tablespaces                    | user       |           |              |
 TimeZone                            | user       |           |              | GMT
 timezone_abbreviations              | user       |           |              |
 trace_notify                        | user       |           |              | off
 trace_recovery_messages             | sighup     |           |              | log
 trace_sort                          | user       |           |              | off
 track_activities                    | superuser  |           |              | on
 track_activity_query_size           | postmaster | 100       | 102400       | 1024
 track_counts                        | superuser  |           |              | on
 track_functions                     | superuser  |           |              | none
 track_io_timing                     | superuser  |           |              | off
 transaction_deferrable              | user       |           |              | off
 transaction_isolation               | user       |           |              | default
 transaction_read_only               | user       |           |              | off
 transform_null_equals               | user       |           |              | off
 unix_socket_directories             | postmaster |           |              | /var/run/postgresql
 unix_socket_group                   | postmaster |           |              |
 unix_socket_permissions             | postmaster | 0         | 511          | 511
 update_process_title                | superuser  |           |              | on
 vacuum_cost_delay                   | user       | 0         | 100          | 0
 vacuum_cost_limit                   | user       | 1         | 10000        | 200
 vacuum_cost_page_dirty              | user       | 0         | 10000        | 20
 vacuum_cost_page_hit                | user       | 0         | 10000        | 1
 vacuum_cost_page_miss               | user       | 0         | 10000        | 10
 vacuum_defer_cleanup_age            | sighup     | 0         | 1000000      | 0
 vacuum_freeze_min_age               | user       | 0         | 1000000000   | 50000000
 vacuum_freeze_table_age             | user       | 0         | 2000000000   | 150000000
 vacuum_multixact_freeze_min_age     | user       | 0         | 1000000000   | 5000000
 vacuum_multixact_freeze_table_age   | user       | 0         | 2000000000   | 150000000
 wal_block_size                      | internal   | 8192      | 8192         | 8192
 wal_buffers                         | postmaster | -1        | 2147483647   | -1
 wal_keep_segments                   | sighup     | 0         | 2147483647   | 0
 wal_level                           | postmaster |           |              | minimal
 wal_receiver_status_interval        | sighup     | 0         | 2147483      | 10
 wal_receiver_timeout                | sighup     | 0         | 2147483647   | 60000
 wal_segment_size                    | internal   | 2048      | 2048         | 2048
 wal_sender_timeout                  | sighup     | 0         | 2147483647   | 60000
 wal_sync_method                     | sighup     |           |              | fdatasync
 wal_writer_delay                    | sighup     | 1         | 10000        | 200
 work_mem                            | user       | 64        | 2147483647   | 1024
 xmlbinary                           | user       |           |              | base64
 xmloption                           | user       |           |              | content
 zero_damaged_pages                  | superuser  |           |              | off

您真的需要ALL unit_ids吗?因为WHERE子句会有所帮助。
Mihai 2014年

不幸的是,我确实需要所有这些,并且使情况变得更糟,现在我得到的回报仅为“ 11000+”,但将来,我相信它们的数量会增加5-10倍。
2014年

尝试SELECT DISTINCT ON (unit_id),unit_timestamp FROM t ORDER BY unit_timestamp DESC在unit_timestamp上使用单独的索引会有所帮助。
Mihai 2014年

物化视图是一种选择吗?
user1363989 2014年

嗯,根本不了解物化视图。我将尝试创建一个,看看是否有用!
2014年

Answers:


13

询问

您的查询被迫扫描整个表(或整个索引)。每行可以是另一个不同的单元。显着缩短该过程的唯一方法是使用一个单独的表,其中包含所有可用的单位 -只要有的单位大大少于中的条目,这将有所帮助all_units
由于您有2500万条目的〜11k单位(在注释中添加了),因此这肯定会有所帮助。

根据值的频率,可以使用两种查询技术来更快地获得结果:

  • 递归CTE
  • JOIN LATERAL
  • 相关子查询

关于SO的此相关答案中的详细信息:

只需要使用on上主键的隐式索引(unit_id, unit_timestamp),此查询就可以使用隐式来解决问题JOIN LATERAL

SELECT u.unit_id, a.max_ts
FROM unit u
  , (SELECT unit_timestamp AS max_ts
     FROM   all_units
     WHERE  unit_id = u.unit_id
     ORDER  BY unit_timestamp DESC
     LIMIT  1
     ) a;

排除没有输入的单位all_units,例如您的原始查询。
或相关性较低的子查询(可能更快):

SELECT u.unit_id
    , (SELECT unit_timestamp
       FROM   all_units
       WHERE  unit_id = u.unit_id
       ORDER  BY unit_timestamp DESC
       LIMIT  1) AS max_ts
FROM unit u;

包括未在中输入的单位all_units

效率取决于每个单元的条目。条目越多,这些查询之一的潜力就越大。

在具有相似表(500个“单位”,大表中有1M行)的快速本地测试中,具有相关子查询的查询比原始查询500倍。大表的PK索引上的仅索引扫描与原始查询中的顺序扫描相比。

由于您使用table tends to get even larger rapidly,因此物化视图可能不是一个选择。

还有DISTINCT ON另一种可能的查询技术,但是它几乎不会比原始查询快,因此不是您要寻找的答案。详细信息在这里:

指数

您的partial_idx

CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);

实际上不是部分索引,而且也是多余的。Postgres可以以几乎相同的速度向后扫描索引,PK的性能很好。删除此附加索引。

表格布局

表定义的几点。

CREATE TABLE all_units (
unit_timestamp timestamp,
unit_id int4,
lon     float4,
lat     float4,
speed   float4,
status  varchar(255),   -- might be improved.
PRIMARY KEY (unit_id, unit_timestamp)
);
  • timestamp(6)没什么意义,实际上与just相同timestamp,后者最多可以保存6个小数位。

  • 我切换了前两列的位置以节省4个字节的填充,对于25M行,其填充量约为100 MB(实际结果取决于status)。较小的表通常对所有内容都更快。

  • 如果status不是自由文本,而是某种标准化的注释,则可以用便宜得多的东西代替它。有关varchar(255)Postgres的更多信息

服务器配置

您需要配置服务器。您的大多数设置似乎都是保守的默认设置。对于具有数百万行的安装,大约1 MB shared_bufferswork_mem似乎不足random_pare_cost = 4对于具有大量RAM的任何现代系统来说,这都是很高的。从手册和Postgres Wiki开始:


为了完整起见,你应该添加DISTINCT ON在查询技术列表mthod这个问题(当然,如果OP要展示的所有行,因为他宣称,而不仅仅是2列的代码所示)
ypercubeᵀᴹ

@ypercube:是的,是的。但这几乎不会更快。不是OP寻找的答案。
Erwin Brandstetter 2014年

您能否详细说明递归CTE如何更快(甚至:它如何首先解决问题-我从未考虑过递归CTE用于“最大”问题)
a_horse_with_no_name 2014年

1
@Mihai:不,它只有一个位图索引扫描(即它会“动态”生成位图索引)
a_horse_with_no_name 2014年

3
@ErwinBrandstetter好!首先,非常感谢您对此详尽的回答,它非常有帮助,我无法解释您的多少!:)其次,您的查询(第二个查询)非常好!现在查询时间约为0.2秒!!!:)我也尝试使用第一个查询,但是我没有设法解决我在内部查询的invalid reference to FROM-clause entry for table "u"WHERE子句中遇到的错误:抱歉,但是我本人对数据库不是很好,所以我没有压抑它,因为第二个查询已经帮了很多忙。
2014年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.