如何从mongodb导入数据到熊猫？

97

我需要分析mongodb中的集合中有大量数据。如何将这些数据导入熊猫？

我是熊猫和numpy的新手。

编辑：mongodb集合包含带有日期和时间标记的传感器值。传感器值是float数据类型。

样本数据：

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

— 尼丁
source

在MongoEngine中使用自定义字段类型可以使存储和检索Pandas DataFrame变得简单，如mongo_doc.data_frame = my_pandas_df

— Jthorpe

131

pymongo 可能会帮助您，以下是我正在使用的一些代码：

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

— 等待状态
source

谢谢，这是我最终使用的方法。我在每一行中都有一系列嵌入式文档。因此，我必须在每一行中都进行迭代。有一个更好的方法吗？？

— Nithin

是否可以提供您的mongodb结构的一些示例？

— waitingkuo

3

请注意，list()内部将df = pd.DataFrame(list(cursor))评估为列表或生成器，以使CPU保持凉爽。如果您有不计其数的数据项，并且接下来的几行将进行合理的划分，详细程度的调整和剪切，那么整个shmegegge仍然可以放心使用。很好。

— Phlip

2

@非常慢df = pd.DataFrame(list(cursor))。纯粹的数据库查询要快得多。我们可以将list演员表更改为其他内容吗？

— Peter.k，

1

@Peter那条线也引起了我的注意。在我看来，将数据库游标（该游标设计为可迭代的并且可能将大量数据包装）投射到内存中列表中似乎并不聪明。

— 拉法

39

您可以使用此代码将mongodb数据加载到pandas DataFrame。这个对我有用。希望也能为您服务。

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

— saimadhu.polamuri
source

24

Monary正是这样做的，而且它超级快。（另一个链接）

请参阅这篇很酷的文章，其中包括快速教程和一些时间安排。

— shx2
source

Monary是否支持字符串数据类型？

— Snehal Parmar

我尝试了Monary，但是要花很多时间。我是否缺少一些优化？试图

client = Monary(host, 27017, database="db_tmp") 		columns = ["col1", "col2"] 		data_type = ["int64", "int64"] 		arrays = client.query("db_tmp", "coll", {}, columns, data_type)

进行50000记录需要花费时间200s。

— nishant '17

听起来太慢了……坦白说，四年之后，我不知道这个项目的状态是什么...

— shx2

16

根据PEP，简单胜于复杂：

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

您可以像使用常规mongoDB数据库一样包含条件，甚至可以使用find_one（）从数据库中仅获取一个元素，等等。

和瞧！

— 赛布
source

pd.DataFrame.from_records似乎和DataFrame（list（））一样慢，但是结果非常不一致。%% time显示从800毫秒到1.9 s之间的任何时间

— AFD

1

这对于大容量记录并没有好处，因为它不会显示内存错误，因为instread会使系统挂起太大的数据。而pd.DataFrame（list（cursor））显示内存错误。

— Amulya Acharya

13

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

— wt
source

9

为了有效地处理内核外（不适合RAM）数据（即并行执行），可以尝试使用Python Blaze生态系统：Blaze / Dask / Odo。

Blaze（和Odo）具有开箱即用的功能来处理MongoDB。

一些有用的文章开始：

还有一篇文章展示了Blaze堆栈可能带来的惊人功能：使用Blaze和Impala分析17亿条Reddit注释（本质上是在几秒钟内查询975 Gb Reddit注释）。

PS我不属于任何这些技术。

— 丹尼斯·戈洛马佐夫（Dennis Golomazov）
source

1

我还使用Jupyter Notebook 写了一篇文章，并举例说明了Dask如何通过在单台计算机上使用多个内核来帮助加快执行速度，甚至在将数据装入内存的情况下也是如此。

— 丹尼斯·哥洛马佐夫

8

我发现非常有用的另一个选择是：

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

这样，您就可以免费获取嵌套的mongodb文档。

— 伊卡（IkarPohorský）
source

2

我在使用此方法时遇到了错误TypeError: data argument can't be an iterator

— Gabriel Fair

2

奇怪，这在3.6.7使用pandas的python上起作用0.24.2。也许您可以尝试df = json_normalize(list(cursor))？

— IkarPohorský19年

对于+1。docs，max_level参数定义dict深度的最大级别。我刚刚进行了测试，但事实并非如此，因此某些列需要使用.str accesrors进行拆分。尽管如此，使用mongodb还是非常不错的功能。

— 毛里西奥·马罗托

5

使用

pandas.DataFrame(list(...))

如果迭代器/生成器结果很大，将消耗大量内存

更好地生成小块并最终合并

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

— 梁德o
source

3

http://docs.mongodb.org/manual/reference/mongoexport

导出到csv并使用read_csv 或JSON并使用DataFrame.from_records()

— 杰夫
source

2

是DataFrame.from_records()。

— 莫滕2014年

1

在waitingkuo回答了这个很好的答案之后，我想添加使用与.read_sql（）和.read_csv（）一致的 chunksize进行此操作的可能性。我通过避免“迭代器” /“光标”的每个“记录”一一列举来扩大Deu Leung的答案。我将借用以前的read_mongo函数。

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df

— 拉斐尔·瓦莱罗（Rafael Valero）
source

1

使用分页的类似方法，例如Rafael Valero，waitingkuo和Deu Leung ：

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

— 乔迪·库恩（Jordy Cuan）
source

0

您可以通过三行使用pdmongo实现所需的功能：

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

如果您的数据非常大，则可以首先通过过滤不需要的数据来进行汇总查询，然后将它们映射到所需的列。

这是映射Readings.a到列a并按列过滤的示例reportCount：

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo接受与pymongo聚合相同的参数

— 帕卡利斯
source