为什么SQLAlchemy用sqlite插入比直接使用sqlite3慢25倍?


79

为什么这个简单的测试用例用SQLAlchemy插入100,000行比直接使用sqlite3驱动程序慢25倍?我在现实世界的应用程序中看到过类似的减速情况。难道我做错了什么?

#!/usr/bin/env python
# Why is SQLAlchemy with SQLite so slow?
# Output from this program:
# SqlAlchemy: Total time for 100000 records 10.74 secs
# sqlite3:    Total time for 100000 records  0.40 secs


import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine 
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())

class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))

def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    engine  = create_engine(dbname, echo=False)
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

def test_sqlalchemy(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
    DBSession.commit()
    print "SqlAlchemy: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn

def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"

if __name__ == '__main__':
    test_sqlalchemy(100000)
    test_sqlite3(100000)

我尝试了许多变体(请参阅http://pastebin.com/zCmzDraU

Answers:


186

将更改同步到数据库时,SQLAlchemy ORM使用工作单元模式。这种模式远远超出了简单的数据“插入”。它包括使用属性检测系统接收在对象上分配的属性,该系统会跟踪对象进行更改时的变化,包括在标识图中跟踪所有插入的行这样做的结果是,对于每行,SQLAlchemy必须检索其“最后插入的ID”(如果尚未给出),并且还涉及对要插入的行进行扫描并根据需要对相关性进行排序。对象也要经过一定程度的记账才能保持所有这些运行,这对于大量行一次而言可能会花费大量时间处理大型数据结构,因此最好对它们进行分块。

基本上,工作单元是高度自动化的,以实现将复杂对象图持久化到没有显式持久性代码的关系数据库中的任务,并且这种自动化是有代价的。

因此,ORM基本上不适合用于高性能批量插入。这就是SQLAlchemy具有两个单独的库的全部原因,如果您查看http://docs.sqlalchemy.org/zh-CN/latest/index.html,就会注意到,您会在索引页面上看到两个截然不同的两半-一个用于ORM,一个用于Core。如果您不了解两者,就无法有效地使用SQLAlchemy。

对于快速批量插入的用例,SQLAlchemy提供了core,它是ORM在其之上构建的SQL生成和执行系统。有效地使用该系统,我们可以生产出与原始SQLite版本相比具有竞争力的INSERT。下面的脚本对此进行了说明,还提供了一个预分配主键标识符的ORM版本,以便ORM可以使用executemany()插入行。两种ORM版本也一次将刷新记录分块进行,这会对性能产生重大影响。

这里观察到的运行时是:

SqlAlchemy ORM: Total time for 100000 records 16.4133379459 secs
SqlAlchemy ORM pk given: Total time for 100000 records 9.77570986748 secs
SqlAlchemy Core: Total time for 100000 records 0.568737983704 secs
sqlite3: Total time for 100000 records 0.595796823502 sec

脚本:

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())

class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))

def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name":'NAME ' + str(i)} for i in range(n)]
    )
    print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn

def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"

if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

另请参阅:http : //docs.sqlalchemy.org/en/latest/faq/performance.html


感谢您的解释。engine.execute()与DBSession.execute()显着不同吗?我曾尝试使用DBSession.execute()插入表达式,但是它没有比完整的ORM版本快很多。
braddock

4
engine.execute()和DBSession.execute()几乎相同,除了DBSession.execute()会将给定的纯SQL字符串包装在text()中。如果您使用execute / executemany语法,它将产生巨大的差异。pysqlite完全用C编写,几乎没有延迟,因此添加到它的execute()调用中的任何Python开销都将在性能分析中明显显示。即使是纯Python函数调用,也比pysqlite的execute()这样的纯C函数调用要慢得多。您还需要考虑到SQLAlchemy表达式构造在每次execute()调用时都要经过一个编译步骤。
zzzeek 2012年

3
核心是首先创建的,尽管在最初几周后,一旦概念的核心证明起作用(并且很糟糕),ORM和核心便从那时开始并行开发。
zzzeek

2
我真的不知道为什么有人会选择ORM模型。使用数据库的大多数项目将具有+10,000行。维护2种更新方法(一种用于单行,另一种用于批量)听起来并不明智。
彼得·摩尔

5
将有.... 10000行,它们需要一直一次全部批量插入?不是特别。例如,绝大多数Web应用程序可能每个请求交换六行。ORM在一些非常著名和高流量的网站上非常受欢迎。
zzzeek 2014年

20

@zzzeek的出色回答。对于那些想知道相同统计信息的人,我已经稍微修改了@zzzeek代码以在插入它们后立即查询这些相同记录,然后将这些记录转换为字典列表。

这是结果

SqlAlchemy ORM: Total time for 100000 records 11.9210000038 secs
SqlAlchemy ORM query: Total time for 100000 records 2.94099998474 secs
SqlAlchemy ORM pk given: Total time for 100000 records 7.51800012589 secs
SqlAlchemy ORM pk given query: Total time for 100000 records 3.07699990273 secs
SqlAlchemy Core: Total time for 100000 records 0.431999921799 secs
SqlAlchemy Core query: Total time for 100000 records 0.389000177383 secs
sqlite3: Total time for 100000 records 0.459000110626 sec
sqlite3 query: Total time for 100000 records 0.103999853134 secs

有趣的是,使用裸sqlite3进行查询仍然比使用SQLAlchemy Core快3倍。我想这就是您返回ResultProxy而不是裸出的sqlite3行所要付出的代价。

SQLAlchemy Core大约比使用ORM快8倍。因此,无论如何,使用ORM进行查询都会慢很多。

这是我使用的代码:

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.sql import select

Base = declarative_base()
DBSession = scoped_session(sessionmaker())

class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))

def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
    t0 = time.time()
    q = DBSession.query(Customer)
    dict = [{'id':r.id, 'name':r.name} for r in q]
    print "SqlAlchemy ORM query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"


def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
    t0 = time.time()
    q = DBSession.query(Customer)
    dict = [{'id':r.id, 'name':r.name} for r in q]
    print "SqlAlchemy ORM pk given query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name":'NAME ' + str(i)} for i in range(n)]
    )
    print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
    conn = engine.connect()
    t0 = time.time()
    sql = select([Customer.__table__])
    q = conn.execute(sql)
    dict = [{'id':r[0], 'name':r[0]} for r in q]
    print "SqlAlchemy Core query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"

def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn

def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"
    t0 = time.time()
    q = conn.execute("SELECT * FROM customer").fetchall()
    dict = [{'id':r[0], 'name':r[0]} for r in q]
    print "sqlite3 query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"


if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

我还进行了测试,未将查询结果转换为字典,并且统计信息类似:

SqlAlchemy ORM: Total time for 100000 records 11.9189999104 secs
SqlAlchemy ORM query: Total time for 100000 records 2.78500008583 secs
SqlAlchemy ORM pk given: Total time for 100000 records 7.67199993134 secs
SqlAlchemy ORM pk given query: Total time for 100000 records 2.94000005722 secs
SqlAlchemy Core: Total time for 100000 records 0.43700003624 secs
SqlAlchemy Core query: Total time for 100000 records 0.131000041962 secs
sqlite3: Total time for 100000 records 0.500999927521 sec
sqlite3 query: Total time for 100000 records 0.0859999656677 secs

与ORM相比,使用SQLAlchemy Core进行查询大约要快20倍。

重要的是要注意,这些测试是非常肤浅的,不应太当真。我可能会错过一些明显的技巧,这些技巧可能会完全改变统计数据。

衡量性能改进的最佳方法是直接在您自己的应用程序中。不要认为我的统计数据是理所当然的。


只是想让您知道,在2019年所有内容的最新版本中,我没有发现与您的时间安排有明显的相对偏差。不过,我也很好奇是否错过了一些“技巧”。
PascalVKooten

0

我会尝试插入表达式测试,然后进行基准测试。

由于OR映射器的开销,它可能仍然会更慢,但是我希望不会那么慢。

您介意尝试并发布结果吗?这是非常有趣的东西。


1
使用插入表达式仅快10%。我希望我知道为什么:SqlAlchemy插入:100000条记录的总时间9.47秒
braddock 2012年

这样做并不是要打扰您,但如果您有兴趣,也许可以在插入并使用timit之后安排与数据库会话相关的代码的时间。docs.python.org/library/timeit.html
Edmon

我有与插入表达了同样的问题,it's死慢,看到stackoverflow.com/questions/11887895/...
dorvak
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.