Questions tagged «pyarrow»

1
羽毛和镶木地板有什么区别?
两者都是用于数据分析系统的列式(磁盘)存储格式。两者都集成在Apache Arrow(用于python的pyarrow软件包)中,并且旨在与Arrow对应,作为列式内存分析层。 两种格式有何不同? 如果可能的话,在与熊猫一起工作时,您是否应该总是喜欢羽毛? 在哪些情况下羽毛比实木复合地板更合适,反之则更合适? 附录 我在这里https://github.com/wesm/feather/issues/188找到了一些提示,但是鉴于这个项目的年龄很小,可能有点过时了。 这不是一个严格的速度测试,因为我只是转储并加载整个Dataframe,但是如果您以前从未听说过这些格式,则可以给您留下深刻的印象: # IPython import numpy as np import pandas as pd import pyarrow as pa import pyarrow.feather as feather import pyarrow.parquet as pq import fastparquet as fp df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) print("pandas df …

1
熊猫UDF和pyarrow 0.15.0
最近,我开始pyspark在EMR群集上运行的许多作业中遇到一堆错误。错误是 java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:98) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)... 它们似乎都发生在apply熊猫系列的功能中。我发现的唯一更改是pyarrow在星期六(05/10/2019)更新的。测试似乎适用于0.14.1 因此,我的问题是,是否有人知道这是新更新的pyarrow中的错误,还是有一些重大更改会导致pandasUDF将来难以使用?
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.