\n",
" \n",
" 0 | \n",
- " BigQuery is a serverless, highly scalable, and... | \n",
- " [{\"category\":1,\"probability\":1,\"probability_sc... | \n",
+ " ## BigQuery: Your Data Warehouse in the Cloud\n",
+ "... | \n",
+ " [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... | \n",
" | \n",
" What is BigQuery? | \n",
"
\n",
" \n",
" 1 | \n",
- " ## BQML: Bringing Quantum Machine Learning to ... | \n",
- " [{\"category\":1,\"probability\":1,\"probability_sc... | \n",
+ " ## BQML - BigQuery Machine Learning\n",
+ "\n",
+ "BQML stan... | \n",
+ " [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... | \n",
" | \n",
" What is BQML? | \n",
"
\n",
@@ -1819,7 +1326,7 @@
" ## BigQuery DataFrames\n",
"\n",
"BigQuery DataFrames is... | \n",
- " [{\"category\":1,\"probability\":1,\"probability_sc... | \n",
+ " [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... | \n",
" | \n",
" What is BigQuery DataFrames? | \n",
" \n",
@@ -1830,16 +1337,19 @@
],
"text/plain": [
" ml_generate_text_llm_result \\\n",
- "0 BigQuery is a serverless, highly scalable, and... \n",
- "1 ## BQML: Bringing Quantum Machine Learning to ... \n",
+ "0 ## BigQuery: Your Data Warehouse in the Cloud\n",
+ "... \n",
+ "1 ## BQML - BigQuery Machine Learning\n",
+ "\n",
+ "BQML stan... \n",
"2 ## BigQuery DataFrames\n",
"\n",
"BigQuery DataFrames is... \n",
"\n",
" ml_generate_text_rai_result ml_generate_text_status \\\n",
- "0 [{\"category\":1,\"probability\":1,\"probability_sc... \n",
- "1 [{\"category\":1,\"probability\":1,\"probability_sc... \n",
- "2 [{\"category\":1,\"probability\":1,\"probability_sc... \n",
+ "0 [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... \n",
+ "1 [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... \n",
+ "2 [{\"category\":\"HARM_CATEGORY_HATE_SPEECH\",\"prob... \n",
"\n",
" prompt \n",
"0 What is BigQuery? \n",
@@ -1849,18 +1359,18 @@
"[3 rows x 4 columns]"
]
},
- "execution_count": 45,
+ "execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "from bigframes.ml.llm import GeminiTextGenerator\n",
+ "# from bigframes.ml.llm import GeminiTextGenerator\n",
"\n",
- "model = GeminiTextGenerator()\n",
+ "# model = GeminiTextGenerator()\n",
"\n",
- "pred = model.predict(df)\n",
- "pred"
+ "# pred = model.predict(df)\n",
+ "# pred"
]
},
{
@@ -1872,66 +1382,49 @@
},
{
"cell_type": "code",
- "execution_count": 46,
+ "execution_count": 24,
"metadata": {},
"outputs": [
- {
- "data": {
- "text/html": [
- "Query job d5fed4ed-26c2-45a6-b842-3af8c901985c is DONE. 10.7 kB processed. Open Job"
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
{
"name": "stdout",
"output_type": "stream",
"text": [
"## BigQuery DataFrames\n",
"\n",
- "BigQuery DataFrames is an open-source project offered by Google that provides the capabilities of using pandas-style APIs directly in BigQuery's serverless environment for performing SQL and DDL queries. This essentially means you have the flexibility to write pandas code within BigQuery for data exploration, transformation, visualization and building machine learning models. It acts as an intermediary bridge that facilitates SQL queries to the BigQuery engine for running your analysis with speed and scalability on large datasets without requiring manual configuration. It can further extend its functionalities through third-party libraries like scikit-learn, matplotlib, seaborn etc., enhancing its versatility within the realm of data manipulation.\n",
- "\n",
- "Here's are some key benefits associated with BigQuery DataFrames:\n",
- "\n",
- "\n",
- "### Streamlined Experience:\n",
- "\n",
- "BigQuery DataFrames simplifies your development workflow by eliminating the back-and-forth communication between pandas and BigQuery environments for data operations. It allows working seamlessly within BigQuery to leverage powerful SQL features while using familiar pandas functions on data stored within. This enables a smoother process from ingesting, analyzing, and visualizing your data efficiently.\n",
- "\n",
- "### Serverless Infrastructure:\n",
- "\n",
- "One of the greatest advantages of BigQuery DataFrames is that it runs on serverless infrastructure, eliminating the need to maintain complex environments for development. This translates to less complexity, easy management, and a focus on efficient analysis rather than infrastructure upkeep.\n",
- "\n",
- "### Scalable Capabilities:\n",
- "\n",
- "As mentioned, BigQuery excels in dealing with immense data sets with efficient storage and processing power due to its architecture built for handling petabyte-scale datasets in Google Cloud Storage. DataFrames inherits this strength, empowering the analysis of vast information while ensuring speed and reliability throughout.\n",
+ "BigQuery DataFrames is a Python library that allows you to interact with BigQuery data using the familiar Pandas API. This means you can use all the powerful tools and methods from the Pandas library to explore, analyze, and transform your BigQuery data, without needing to learn a new language or API.\n",
"\n",
- "### Open-source Ecosystem:\n",
+ "Here are some of the key benefits of using BigQuery DataFrames:\n",
"\n",
- "While Google spearheads its initial creation, DataFrames benefits immensely from its open-source structure. This fosters community-wide involvement in its advancement; developers are continually making contributions that bolster functionalities, introduce improvements with regular updates and fixes.\n",
+ "* **Ease of use:** If you're already familiar with Pandas, you can start using BigQuery DataFrames with minimal learning curve.\n",
+ "* **Speed and efficiency:** BigQuery DataFrames leverages the power of BigQuery to perform complex operations on large datasets efficiently.\n",
+ "* **Flexibility:** You can use BigQuery DataFrames for a wide range of tasks, including data exploration, analysis, cleaning, and transformation.\n",
+ "* **Integration with other tools:** BigQuery DataFrames integrates seamlessly with other Google Cloud tools like Colab and Vertex AI, allowing you to build end-to-end data analysis pipelines.\n",
"\n",
- "Here are a few scenarios where using BigQuery DataFrames might prove particularly valuable:\n",
+ "Here are some of the key features of BigQuery DataFrames:\n",
"\n",
- "- Performing exploratory analysis on a diverse range of dataset directly on serverless infrastructure with scalability, saving valuable operational cost and time.\n",
+ "* **Support for most Pandas operations:** You can use most of the DataFrame methods you're familiar with, such as `groupby`, `filter`, `sort_values`, and `apply`.\n",
+ "* **Automatic schema inference:** BigQuery DataFrames automatically infers the schema of your data, so you don't need to manually specify it.\n",
+ "* **Efficient handling of large datasets:** BigQuery DataFrames pushes computations to BigQuery, which allows you to work with large datasets without running out of memory.\n",
+ "* **Support for both public and private datasets:** You can use BigQuery DataFrames to access both public and private datasets stored in BigQuery.\n",
"\n",
- "- Implementing data preprocessing steps using Python and DataFrames within Google Cloud Platform without having to transfer and analyze it elsewhere, streamlining workflow within the same framework.\n",
+ "## Getting Started with BigQuery DataFrames\n",
"\n",
- "- When building and training ML models directly from datasets without exporting the data outside, maintaining security within and improving efficiency.\n",
+ "Getting started with BigQuery DataFrames is easy. You just need to install the library and configure your authentication. Once you're set up, you can start using it to interact with your BigQuery data.\n",
"\n",
+ "Here are some resources to help you get started:\n",
"\n",
- "However, be aware that DataFrames is an ever-evolving project and some aspects such as DML functionalities remain under active development to reach feature completion as compared to standard SQL commands which have matured functionality already in place within DataFrames.\n",
+ "* **Documentation:** https://cloud.google.com/bigquery/docs/reference/libraries/bigquery-dataframe\n",
+ "* **Quickstart:** https://cloud.google.com/bigquery/docs/reference/libraries/bigquery-dataframe-python-quickstart\n",
+ "* **Tutorials:** https://cloud.google.com/bigquery/docs/tutorials/bq-dataframe-pandas-tutorial\n",
"\n",
+ "## Conclusion\n",
"\n",
- "Would you like me to delve deeper into specific features of DataFrames, its current limitations or perhaps provide examples of its applications or user cases?\n"
+ "BigQuery DataFrames is a powerful tool that can help you get the most out of your BigQuery data. If you're looking for a way to easily analyze and transform your BigQuery data using the familiar Pandas API, then BigQuery DataFrames is a great option.\n"
]
}
],
"source": [
- "print(pred.loc[2][\"ml_generate_text_llm_result\"])"
+ "# print(pred.loc[2][\"ml_generate_text_llm_result\"])"
]
},
{
@@ -1998,7 +1491,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.12.6"
+ "version": "3.10.12"
}
},
"nbformat": 4,
diff --git a/setup.py b/setup.py
index 74a0d5475c..047da2348c 100644
--- a/setup.py
+++ b/setup.py
@@ -62,6 +62,7 @@
"ipywidgets >=7.7.1",
"humanize >=4.6.0",
"matplotlib >=3.7.1",
+ "db-dtypes >=1.4.0",
# For vendored ibis-framework.
"atpublic>=2.3,<6",
"parsy>=2,<3",
diff --git a/testing/constraints-3.9.txt b/testing/constraints-3.9.txt
index 015153cb01..8b7ad892c0 100644
--- a/testing/constraints-3.9.txt
+++ b/testing/constraints-3.9.txt
@@ -26,6 +26,7 @@ tabulate==0.9
ipywidgets==7.7.1
humanize==4.6.0
matplotlib==3.7.1
+db-dtypes==1.4.0
# For vendored ibis-framework.
atpublic==2.3
parsy==2.0
diff --git a/tests/system/large/test_remote_function.py b/tests/system/large/test_remote_function.py
index d0eb6c1904..f226143b50 100644
--- a/tests/system/large/test_remote_function.py
+++ b/tests/system/large/test_remote_function.py
@@ -32,7 +32,7 @@
import bigframes.dataframe
import bigframes.dtypes
import bigframes.exceptions
-import bigframes.functions._utils as functions_utils
+import bigframes.functions._utils as bff_utils
import bigframes.pandas as bpd
import bigframes.series
from tests.system.utils import (
@@ -633,11 +633,9 @@ def add_one(x):
add_one_uniq, add_one_uniq_dir = make_uniq_udf(add_one)
# Expected cloud function name for the unique udf
- package_requirements = functions_utils._get_updated_package_requirements()
- add_one_uniq_hash = functions_utils._get_hash(
- add_one_uniq, package_requirements
- )
- add_one_uniq_cf_name = functions_utils.get_cloud_function_name(
+ package_requirements = bff_utils._get_updated_package_requirements()
+ add_one_uniq_hash = bff_utils._get_hash(add_one_uniq, package_requirements)
+ add_one_uniq_cf_name = bff_utils.get_cloud_function_name(
add_one_uniq_hash, session.session_id
)
diff --git a/tests/system/small/bigquery/test_json.py b/tests/system/small/bigquery/test_json.py
index b01ac3aaf2..aa490749ae 100644
--- a/tests/system/small/bigquery/test_json.py
+++ b/tests/system/small/bigquery/test_json.py
@@ -118,7 +118,6 @@ def test_json_set_w_invalid_series_type():
def test_json_extract_from_json():
s = _get_series_from_json([{"a": {"b": [1, 2]}}, {"a": {"c": 1}}, {"a": {"b": 0}}])
actual = bbq.json_extract(s, "$.a.b").to_pandas()
- # After the introduction of the JSON type, the output should be a JSON-formatted series.
expected = _get_series_from_json([[1, 2], None, 0]).to_pandas()
pd.testing.assert_series_equal(
actual,
@@ -129,12 +128,10 @@ def test_json_extract_from_json():
def test_json_extract_from_string():
s = bpd.Series(['{"a": {"b": [1, 2]}}', '{"a": {"c": 1}}', '{"a": {"b": 0}}'])
actual = bbq.json_extract(s, "$.a.b")
- expected = _get_series_from_json([[1, 2], None, 0])
+ expected = bpd.Series(["[1,2]", None, "0"])
pd.testing.assert_series_equal(
actual.to_pandas(),
expected.to_pandas(),
- check_names=False,
- check_dtype=False, # json_extract returns string type. While _get_series_from_json gives a JSON series (pa.large_string).
)
@@ -143,20 +140,58 @@ def test_json_extract_w_invalid_series_type():
bbq.json_extract(bpd.Series([1, 2]), "$.a")
+def test_json_extract_array_from_json():
+ s = _get_series_from_json(
+ [{"a": ["ab", "2", "3 xy"]}, {"a": []}, {"a": ["4", "5"]}, {}]
+ )
+ actual = bbq.json_extract_array(s, "$.a")
+
+ # This code provides a workaround for issue https://github.com/apache/arrow/issues/45262,
+ # which currently prevents constructing a series using the pa.list_(db_types.JSONArrrowType())
+ sql = """
+ SELECT 0 AS id, [JSON '"ab"', JSON '"2"', JSON '"3 xy"'] AS data,
+ UNION ALL
+ SELECT 1, [],
+ UNION ALL
+ SELECT 2, [JSON '"4"', JSON '"5"'],
+ UNION ALL
+ SELECT 3, null,
+ """
+ df = bpd.read_gbq(sql).set_index("id").sort_index()
+ expected = df["data"]
+
+ pd.testing.assert_series_equal(
+ actual.to_pandas(),
+ expected.to_pandas(),
+ )
+
+
def test_json_extract_array_from_json_strings():
- s = bpd.Series(['{"a": ["ab", "2", "3 xy"]}', '{"a": []}', '{"a": ["4","5"]}'])
+ s = bpd.Series(
+ ['{"a": ["ab", "2", "3 xy"]}', '{"a": []}', '{"a": ["4","5"]}', "{}"],
+ dtype=pd.StringDtype(storage="pyarrow"),
+ )
actual = bbq.json_extract_array(s, "$.a")
- expected = bpd.Series([['"ab"', '"2"', '"3 xy"'], [], ['"4"', '"5"']])
+ expected = bpd.Series(
+ [['"ab"', '"2"', '"3 xy"'], [], ['"4"', '"5"'], None],
+ dtype=pd.StringDtype(storage="pyarrow"),
+ )
pd.testing.assert_series_equal(
actual.to_pandas(),
expected.to_pandas(),
)
-def test_json_extract_array_from_array_strings():
- s = bpd.Series(["[1, 2, 3]", "[]", "[4,5]"])
+def test_json_extract_array_from_json_array_strings():
+ s = bpd.Series(
+ ["[1, 2, 3]", "[]", "[4,5]"],
+ dtype=pd.StringDtype(storage="pyarrow"),
+ )
actual = bbq.json_extract_array(s)
- expected = bpd.Series([["1", "2", "3"], [], ["4", "5"]])
+ expected = bpd.Series(
+ [["1", "2", "3"], [], ["4", "5"]],
+ dtype=pd.StringDtype(storage="pyarrow"),
+ )
pd.testing.assert_series_equal(
actual.to_pandas(),
expected.to_pandas(),
@@ -164,8 +199,9 @@ def test_json_extract_array_from_array_strings():
def test_json_extract_array_w_invalid_series_type():
+ s = bpd.Series([1, 2])
with pytest.raises(TypeError):
- bbq.json_extract_array(bpd.Series([1, 2]))
+ bbq.json_extract_array(s)
def test_json_extract_string_array_from_json_strings():
@@ -203,14 +239,6 @@ def test_json_extract_string_array_w_invalid_series_type():
bbq.json_extract_string_array(bpd.Series([1, 2]))
-# b/381148539
-def test_json_in_struct():
- df = bpd.read_gbq(
- "SELECT STRUCT(JSON '{\\\"a\\\": 1}' AS data, 1 AS number) as struct_col"
- )
- assert df["struct_col"].struct.field("data")[0] == '{"a":1}'
-
-
def test_parse_json_w_invalid_series_type():
with pytest.raises(TypeError):
bbq.parse_json(bpd.Series([1, 2]))
diff --git a/tests/system/small/operations/test_plotting.py b/tests/system/small/operations/test_plotting.py
index 3624232ea0..c2f3ba423f 100644
--- a/tests/system/small/operations/test_plotting.py
+++ b/tests/system/small/operations/test_plotting.py
@@ -34,10 +34,20 @@ def _check_legend_labels(ax, labels):
assert label == e
-def test_series_hist_bins(scalars_dfs):
+@pytest.mark.parametrize(
+ ("alias"),
+ [
+ pytest.param(True),
+ pytest.param(False),
+ ],
+)
+def test_series_hist_bins(scalars_dfs, alias):
scalars_df, scalars_pandas_df = scalars_dfs
bins = 5
- ax = scalars_df["int64_col"].plot.hist(bins=bins)
+ if alias:
+ ax = scalars_df["int64_col"].hist(bins=bins)
+ else:
+ ax = scalars_df["int64_col"].plot.hist(bins=bins)
pd_ax = scalars_pandas_df["int64_col"].plot.hist(bins=bins)
# Compares axis values and height between bigframes and pandas histograms.
@@ -49,11 +59,21 @@ def test_series_hist_bins(scalars_dfs):
assert ax.patches[i]._height == pd_ax.patches[i]._height
-def test_dataframes_hist_bins(scalars_dfs):
+@pytest.mark.parametrize(
+ ("alias"),
+ [
+ pytest.param(True),
+ pytest.param(False),
+ ],
+)
+def test_dataframes_hist_bins(scalars_dfs, alias):
scalars_df, scalars_pandas_df = scalars_dfs
bins = 7
columns = ["int64_col", "int64_too", "float64_col"]
- ax = scalars_df[columns].plot.hist(bins=bins)
+ if alias:
+ ax = scalars_df[columns].hist(bins=bins)
+ else:
+ ax = scalars_df[columns].plot.hist(bins=bins)
pd_ax = scalars_pandas_df[columns].plot.hist(bins=bins)
# Compares axis values and height between bigframes and pandas histograms.
@@ -171,10 +191,25 @@ def test_hist_kwargs_ticks_props(scalars_dfs):
tm.assert_almost_equal(ylabels[i].get_rotation(), pd_ylables[i].get_rotation())
-def test_line(scalars_dfs):
+@pytest.mark.parametrize(
+ ("col_names", "alias"),
+ [
+ pytest.param(
+ ["int64_col", "float64_col", "int64_too", "bool_col"], True, id="df_alias"
+ ),
+ pytest.param(
+ ["int64_col", "float64_col", "int64_too", "bool_col"], False, id="df"
+ ),
+ pytest.param(["int64_col"], True, id="series_alias"),
+ pytest.param(["int64_col"], False, id="series"),
+ ],
+)
+def test_line(scalars_dfs, col_names, alias):
scalars_df, scalars_pandas_df = scalars_dfs
- col_names = ["int64_col", "float64_col", "int64_too", "bool_col"]
- ax = scalars_df[col_names].plot.line()
+ if alias:
+ ax = scalars_df[col_names].line()
+ else:
+ ax = scalars_df[col_names].plot.line()
pd_ax = scalars_pandas_df[col_names].plot.line()
tm.assert_almost_equal(ax.get_xticks(), pd_ax.get_xticks())
tm.assert_almost_equal(ax.get_yticks(), pd_ax.get_yticks())
@@ -183,10 +218,21 @@ def test_line(scalars_dfs):
tm.assert_almost_equal(line.get_data()[1], pd_line.get_data()[1])
-def test_area(scalars_dfs):
+@pytest.mark.parametrize(
+ ("col_names", "alias"),
+ [
+ pytest.param(["int64_col", "float64_col", "int64_too"], True, id="df_alias"),
+ pytest.param(["int64_col", "float64_col", "int64_too"], False, id="df"),
+ pytest.param(["int64_col"], True, id="series_alias"),
+ pytest.param(["int64_col"], False, id="series"),
+ ],
+)
+def test_area(scalars_dfs, col_names, alias):
scalars_df, scalars_pandas_df = scalars_dfs
- col_names = ["int64_col", "float64_col", "int64_too"]
- ax = scalars_df[col_names].plot.area(stacked=False)
+ if alias:
+ ax = scalars_df[col_names].area(stacked=False)
+ else:
+ ax = scalars_df[col_names].plot.area(stacked=False)
pd_ax = scalars_pandas_df[col_names].plot.area(stacked=False)
tm.assert_almost_equal(ax.get_xticks(), pd_ax.get_xticks())
tm.assert_almost_equal(ax.get_yticks(), pd_ax.get_yticks())
@@ -195,10 +241,21 @@ def test_area(scalars_dfs):
tm.assert_almost_equal(line.get_data()[1], pd_line.get_data()[1])
-def test_bar(scalars_dfs):
+@pytest.mark.parametrize(
+ ("col_names", "alias"),
+ [
+ pytest.param(["int64_col", "float64_col", "int64_too"], True, id="df_alias"),
+ pytest.param(["int64_col", "float64_col", "int64_too"], False, id="df"),
+ pytest.param(["int64_col"], True, id="series_alias"),
+ pytest.param(["int64_col"], False, id="series"),
+ ],
+)
+def test_bar(scalars_dfs, col_names, alias):
scalars_df, scalars_pandas_df = scalars_dfs
- col_names = ["int64_col", "float64_col", "int64_too"]
- ax = scalars_df[col_names].plot.bar()
+ if alias:
+ ax = scalars_df[col_names].bar()
+ else:
+ ax = scalars_df[col_names].plot.bar()
pd_ax = scalars_pandas_df[col_names].plot.bar()
tm.assert_almost_equal(ax.get_xticks(), pd_ax.get_xticks())
tm.assert_almost_equal(ax.get_yticks(), pd_ax.get_yticks())
@@ -207,10 +264,23 @@ def test_bar(scalars_dfs):
tm.assert_almost_equal(line.get_data()[1], pd_line.get_data()[1])
-def test_scatter(scalars_dfs):
+@pytest.mark.parametrize(
+ ("col_names", "alias"),
+ [
+ pytest.param(
+ ["int64_col", "float64_col", "int64_too", "bool_col"], True, id="df_alias"
+ ),
+ pytest.param(
+ ["int64_col", "float64_col", "int64_too", "bool_col"], False, id="df"
+ ),
+ ],
+)
+def test_scatter(scalars_dfs, col_names, alias):
scalars_df, scalars_pandas_df = scalars_dfs
- col_names = ["int64_col", "float64_col", "int64_too", "bool_col"]
- ax = scalars_df[col_names].plot.scatter(x="int64_col", y="float64_col")
+ if alias:
+ ax = scalars_df[col_names].scatter(x="int64_col", y="float64_col")
+ else:
+ ax = scalars_df[col_names].plot.scatter(x="int64_col", y="float64_col")
pd_ax = scalars_pandas_df[col_names].plot.scatter(x="int64_col", y="float64_col")
tm.assert_almost_equal(ax.get_xticks(), pd_ax.get_xticks())
tm.assert_almost_equal(ax.get_yticks(), pd_ax.get_yticks())
diff --git a/tests/system/small/test_dataframe.py b/tests/system/small/test_dataframe.py
index e7d6ad67e1..4266cdba88 100644
--- a/tests/system/small/test_dataframe.py
+++ b/tests/system/small/test_dataframe.py
@@ -331,6 +331,17 @@ def test_where_series_cond(scalars_df_index, scalars_pandas_df_index):
pandas.testing.assert_frame_equal(bf_result, pd_result)
+def test_mask_series_cond(scalars_df_index, scalars_pandas_df_index):
+ cond_bf = scalars_df_index["int64_col"] > 0
+ cond_pd = scalars_pandas_df_index["int64_col"] > 0
+
+ bf_df = scalars_df_index[["int64_too", "int64_col", "float64_col"]]
+ pd_df = scalars_pandas_df_index[["int64_too", "int64_col", "float64_col"]]
+ bf_result = bf_df.mask(cond_bf, bf_df + 1).to_pandas()
+ pd_result = pd_df.mask(cond_pd, pd_df + 1)
+ pandas.testing.assert_frame_equal(bf_result, pd_result)
+
+
def test_where_series_multi_index(scalars_df_index, scalars_pandas_df_index):
# Test when a dataframe has multi-index or multi-columns.
columns = ["int64_col", "float64_col"]
@@ -2235,6 +2246,72 @@ def test_cov_w_numeric_only(scalars_dfs_maybe_ordered, columns, numeric_only):
)
+def test_df_corrwith_df(scalars_dfs_maybe_ordered):
+ scalars_df, scalars_pandas_df = scalars_dfs_maybe_ordered
+
+ l_cols = ["int64_col", "float64_col", "int64_too"]
+ r_cols = ["int64_too", "float64_col"]
+
+ bf_result = scalars_df[l_cols].corrwith(scalars_df[r_cols]).to_pandas()
+ pd_result = scalars_pandas_df[l_cols].corrwith(scalars_pandas_df[r_cols])
+
+ # BigFrames and Pandas differ in their data type handling:
+ # - Column types: BigFrames uses Float64, Pandas uses float64.
+ # - Index types: BigFrames uses strign, Pandas uses object.
+ pd.testing.assert_series_equal(
+ bf_result, pd_result, check_dtype=False, check_index_type=False
+ )
+
+
+def test_df_corrwith_df_numeric_only(scalars_dfs):
+ scalars_df, scalars_pandas_df = scalars_dfs
+
+ l_cols = ["int64_col", "float64_col", "int64_too", "string_col"]
+ r_cols = ["int64_too", "float64_col", "bool_col"]
+
+ bf_result = (
+ scalars_df[l_cols].corrwith(scalars_df[r_cols], numeric_only=True).to_pandas()
+ )
+ pd_result = scalars_pandas_df[l_cols].corrwith(
+ scalars_pandas_df[r_cols], numeric_only=True
+ )
+
+ # BigFrames and Pandas differ in their data type handling:
+ # - Column types: BigFrames uses Float64, Pandas uses float64.
+ # - Index types: BigFrames uses strign, Pandas uses object.
+ pd.testing.assert_series_equal(
+ bf_result, pd_result, check_dtype=False, check_index_type=False
+ )
+
+
+def test_df_corrwith_df_non_numeric_error(scalars_dfs):
+ scalars_df, _ = scalars_dfs
+
+ l_cols = ["int64_col", "float64_col", "int64_too", "string_col"]
+ r_cols = ["int64_too", "float64_col", "bool_col"]
+
+ with pytest.raises(NotImplementedError):
+ scalars_df[l_cols].corrwith(scalars_df[r_cols], numeric_only=False)
+
+
+@skip_legacy_pandas
+def test_df_corrwith_series(scalars_dfs_maybe_ordered):
+ scalars_df, scalars_pandas_df = scalars_dfs_maybe_ordered
+
+ l_cols = ["int64_col", "float64_col", "int64_too"]
+ r_col = "float64_col"
+
+ bf_result = scalars_df[l_cols].corrwith(scalars_df[r_col]).to_pandas()
+ pd_result = scalars_pandas_df[l_cols].corrwith(scalars_pandas_df[r_col])
+
+ # BigFrames and Pandas differ in their data type handling:
+ # - Column types: BigFrames uses Float64, Pandas uses float64.
+ # - Index types: BigFrames uses strign, Pandas uses object.
+ pd.testing.assert_series_equal(
+ bf_result, pd_result, check_dtype=False, check_index_type=False
+ )
+
+
@pytest.mark.parametrize(
("op"),
[
diff --git a/tests/system/small/test_dataframe_io.py b/tests/system/small/test_dataframe_io.py
index 848e21f6bd..10637b2395 100644
--- a/tests/system/small/test_dataframe_io.py
+++ b/tests/system/small/test_dataframe_io.py
@@ -12,8 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+import math
from typing import Tuple
+import db_dtypes # type:ignore
import google.api_core.exceptions
import pandas as pd
import pandas.testing
@@ -247,23 +249,146 @@ def test_to_pandas_array_struct_correct_result(session):
)
-def test_load_json(session):
- df = session.read_gbq(
- """SELECT
- JSON_OBJECT('foo', 10, 'bar', TRUE) AS json_column
- """
- )
-
+def test_load_json_w_unboxed_py_value(session):
+ sql = """
+ SELECT 0 AS id, JSON_OBJECT('boolean', True) AS json_col,
+ UNION ALL
+ SELECT 1, JSON_OBJECT('int', 100),
+ UNION ALL
+ SELECT 2, JSON_OBJECT('float', 0.98),
+ UNION ALL
+ SELECT 3, JSON_OBJECT('string', 'hello world'),
+ UNION ALL
+ SELECT 4, JSON_OBJECT('array', [8, 9, 10]),
+ UNION ALL
+ SELECT 5, JSON_OBJECT('null', null),
+ UNION ALL
+ SELECT
+ 6,
+ JSON_OBJECT(
+ 'dict',
+ JSON_OBJECT(
+ 'int', 1,
+ 'array', [JSON_OBJECT('bar', 'hello'), JSON_OBJECT('foo', 1)]
+ )
+ ),
+ """
+ df = session.read_gbq(sql, index_col="id")
+
+ assert df.dtypes["json_col"] == db_dtypes.JSONDtype()
+ assert isinstance(df["json_col"][0], dict)
+
+ assert df["json_col"][0]["boolean"]
+ assert df["json_col"][1]["int"] == 100
+ assert math.isclose(df["json_col"][2]["float"], 0.98)
+ assert df["json_col"][3]["string"] == "hello world"
+ assert df["json_col"][4]["array"] == [8, 9, 10]
+ assert df["json_col"][5]["null"] is None
+ assert df["json_col"][6]["dict"] == {
+ "int": 1,
+ "array": [{"bar": "hello"}, {"foo": 1}],
+ }
+
+
+def test_load_json_to_pandas_has_correct_result(session):
+ df = session.read_gbq("SELECT JSON_OBJECT('foo', 10, 'bar', TRUE) AS json_col")
+ assert df.dtypes["json_col"] == db_dtypes.JSONDtype()
result = df.to_pandas()
- expected = pd.DataFrame(
- {
- "json_column": ['{"bar":true,"foo":10}'],
- },
- dtype=pd.ArrowDtype(pa.large_string()),
- )
- expected.index = expected.index.astype("Int64")
- pd.testing.assert_series_equal(result.dtypes, expected.dtypes)
- pd.testing.assert_series_equal(result["json_column"], expected["json_column"])
+
+ # The order of keys within the JSON object shouldn't matter for equality checks.
+ pd_df = pd.DataFrame(
+ {"json_col": [{"bar": True, "foo": 10}]},
+ dtype=db_dtypes.JSONDtype(),
+ )
+ pd_df.index = pd_df.index.astype("Int64")
+ pd.testing.assert_series_equal(result.dtypes, pd_df.dtypes)
+ pd.testing.assert_series_equal(result["json_col"], pd_df["json_col"])
+
+
+def test_load_json_in_struct(session):
+ """Avoid regressions for internal issue 381148539."""
+ sql = """
+ SELECT 0 AS id, STRUCT(JSON_OBJECT('boolean', True) AS data, 1 AS number) AS struct_col
+ UNION ALL
+ SELECT 1, STRUCT(JSON_OBJECT('int', 100), 2),
+ UNION ALL
+ SELECT 2, STRUCT(JSON_OBJECT('float', 0.98), 3),
+ UNION ALL
+ SELECT 3, STRUCT(JSON_OBJECT('string', 'hello world'), 4),
+ UNION ALL
+ SELECT 4, STRUCT(JSON_OBJECT('array', [8, 9, 10]), 5),
+ UNION ALL
+ SELECT 5, STRUCT(JSON_OBJECT('null', null), 6),
+ UNION ALL
+ SELECT
+ 6,
+ STRUCT(JSON_OBJECT(
+ 'dict',
+ JSON_OBJECT(
+ 'int', 1,
+ 'array', [JSON_OBJECT('bar', 'hello'), JSON_OBJECT('foo', 1)]
+ )
+ ), 7),
+ """
+ df = session.read_gbq(sql, index_col="id")
+
+ assert isinstance(df.dtypes["struct_col"], pd.ArrowDtype)
+ assert isinstance(df.dtypes["struct_col"].pyarrow_dtype, pa.StructType)
+
+ data = df["struct_col"].struct.field("data")
+ assert data.dtype == db_dtypes.JSONDtype()
+
+ assert data[0]["boolean"]
+ assert data[1]["int"] == 100
+ assert math.isclose(data[2]["float"], 0.98)
+ assert data[3]["string"] == "hello world"
+ assert data[4]["array"] == [8, 9, 10]
+ assert data[5]["null"] is None
+ assert data[6]["dict"] == {
+ "int": 1,
+ "array": [{"bar": "hello"}, {"foo": 1}],
+ }
+
+
+def test_load_json_in_array(session):
+ sql = """
+ SELECT
+ 0 AS id,
+ [
+ JSON_OBJECT('boolean', True),
+ JSON_OBJECT('int', 100),
+ JSON_OBJECT('float', 0.98),
+ JSON_OBJECT('string', 'hello world'),
+ JSON_OBJECT('array', [8, 9, 10]),
+ JSON_OBJECT('null', null),
+ JSON_OBJECT(
+ 'dict',
+ JSON_OBJECT(
+ 'int', 1,
+ 'array', [JSON_OBJECT('bar', 'hello'), JSON_OBJECT('foo', 1)]
+ )
+ )
+ ] AS array_col,
+ """
+ df = session.read_gbq(sql, index_col="id")
+
+ assert isinstance(df.dtypes["array_col"], pd.ArrowDtype)
+ assert isinstance(df.dtypes["array_col"].pyarrow_dtype, pa.ListType)
+
+ data = df["array_col"].list
+ assert data.len()[0] == 7
+ assert data[0].dtype == db_dtypes.JSONDtype()
+
+ assert data[0][0]["boolean"]
+ assert data[1][0]["int"] == 100
+ assert math.isclose(data[2][0]["float"], 0.98)
+ assert data[3][0]["string"] == "hello world"
+ assert data[4][0]["array"] == [8, 9, 10]
+ assert data[5][0]["null"] is None
+ assert data[6][0]["dict"] == {
+ "int": 1,
+ "array": [{"bar": "hello"}, {"foo": 1}],
+ }
def test_to_pandas_batches_w_correct_dtypes(scalars_df_default_index):
diff --git a/tests/system/small/test_pandas.py b/tests/system/small/test_pandas.py
index 30ffaa8a7d..e46d073056 100644
--- a/tests/system/small/test_pandas.py
+++ b/tests/system/small/test_pandas.py
@@ -13,6 +13,7 @@
# limitations under the License.
from datetime import datetime
+import typing
import pandas as pd
import pytest
@@ -726,3 +727,69 @@ def test_to_datetime_timestamp_inputs(arg, utc, output_in_utc):
pd.testing.assert_series_equal(
bf_result, pd_result, check_index_type=False, check_names=False
)
+
+
+@pytest.mark.parametrize(
+ "unit",
+ [
+ "W",
+ "w",
+ "D",
+ "d",
+ "days",
+ "day",
+ "hours",
+ "hour",
+ "hr",
+ "h",
+ "m",
+ "minute",
+ "min",
+ "minutes",
+ "s",
+ "seconds",
+ "sec",
+ "second",
+ "ms",
+ "milliseconds",
+ "millisecond",
+ "milli",
+ "millis",
+ "us",
+ "microseconds",
+ "microsecond",
+ "µs",
+ "micro",
+ "micros",
+ ],
+)
+def test_to_timedelta_with_bf_series(session, unit):
+ bf_series = bpd.Series([1, 2, 3], session=session)
+ pd_series = pd.Series([1, 2, 3])
+
+ actual_result = (
+ typing.cast(bpd.Series, bpd.to_timedelta(bf_series, unit))
+ .to_pandas()
+ .astype("timedelta64[ns]")
+ )
+
+ expected_result = pd.to_timedelta(pd_series, unit)
+ pd.testing.assert_series_equal(
+ actual_result, expected_result, check_index_type=False
+ )
+
+
+@pytest.mark.parametrize(
+ "unit",
+ ["Y", "M", "whatever"],
+)
+def test_to_timedelta_with_bf_series_invalid_unit(session, unit):
+ bf_series = bpd.Series([1, 2, 3], session=session)
+
+ with pytest.raises(TypeError):
+ bpd.to_timedelta(bf_series, unit)
+
+
+@pytest.mark.parametrize("input", [1, 1.2, "1s"])
+def test_to_timedelta_non_bf_series(input):
+ assert bpd.to_timedelta(input) == pd.to_timedelta(input)
diff --git a/tests/system/small/test_remote_function.py b/tests/system/small/test_remote_function.py
index c3f3890459..0dc8960f62 100644
--- a/tests/system/small/test_remote_function.py
+++ b/tests/system/small/test_remote_function.py
@@ -25,8 +25,8 @@
import bigframes
import bigframes.dtypes
import bigframes.exceptions
-from bigframes.functions import _utils as rf_utils
-from bigframes.functions import remote_function as rf
+from bigframes.functions import _utils as bff_utils
+from bigframes.functions import function as bff
from tests.system.utils import assert_pandas_df_equal
_prefixer = test_utils.prefixer.Prefixer("bigframes", "")
@@ -94,12 +94,12 @@ def get_rf_name(func, package_requirements=None, is_row_processor=False):
"""Get a remote function name for testing given a udf."""
# Augment user package requirements with any internal package
# requirements
- package_requirements = rf_utils._get_updated_package_requirements(
+ package_requirements = bff_utils._get_updated_package_requirements(
package_requirements, is_row_processor
)
# Compute a unique hash representing the user code
- function_hash = rf_utils._get_hash(func, package_requirements)
+ function_hash = bff_utils._get_hash(func, package_requirements)
return f"bigframes_{function_hash}"
@@ -117,7 +117,7 @@ def test_remote_function_direct_no_session_param(
def square(x):
return x * x
- square = rf.remote_function(
+ square = bff.remote_function(
int,
int,
bigquery_client=bigquery_client,
@@ -176,7 +176,7 @@ def test_remote_function_direct_no_session_param_location_specified(
def square(x):
return x * x
- square = rf.remote_function(
+ square = bff.remote_function(
int,
int,
bigquery_client=bigquery_client,
@@ -235,7 +235,7 @@ def square(x):
ValueError,
match=re.escape("The location does not match BigQuery connection location:"),
):
- rf.remote_function(
+ bff.remote_function(
int,
int,
bigquery_client=bigquery_client,
@@ -263,7 +263,7 @@ def test_remote_function_direct_no_session_param_location_project_specified(
def square(x):
return x * x
- square = rf.remote_function(
+ square = bff.remote_function(
int,
int,
bigquery_client=bigquery_client,
@@ -324,7 +324,7 @@ def square(x):
"The project_id does not match BigQuery connection gcp_project_id:"
),
):
- rf.remote_function(
+ bff.remote_function(
int,
int,
bigquery_client=bigquery_client,
@@ -346,7 +346,7 @@ def test_remote_function_direct_session_param(
def square(x):
return x * x
- square = rf.remote_function(
+ square = bff.remote_function(
int,
int,
session=session_with_bq_connection,
@@ -636,7 +636,7 @@ def add_one(x):
def test_read_gbq_function_detects_invalid_function(session, dataset_id):
dataset_ref = bigquery.DatasetReference.from_string(dataset_id)
with pytest.raises(ValueError) as e:
- rf.read_gbq_function(
+ bff.read_gbq_function(
str(dataset_ref.routine("not_a_function")),
session=session,
)
@@ -658,7 +658,7 @@ def test_read_gbq_function_like_original(
def square1(x):
return x * x
- square1 = rf.remote_function(
+ square1 = bff.remote_function(
[int],
int,
bigquery_client=bigquery_client,
@@ -674,7 +674,7 @@ def square1(x):
# Function should still work normally.
assert square1(2) == 4
- square2 = rf.read_gbq_function(
+ square2 = bff.read_gbq_function(
function_name=square1.bigframes_remote_function, # type: ignore
session=session,
)
@@ -745,7 +745,7 @@ def test_read_gbq_function_reads_udfs(session, bigquery_client, dataset_id):
for routine in (sql_routine, js_routine):
# Create the routine in BigQuery and read it back using read_gbq_function.
bigquery_client.create_routine(routine, exists_ok=True)
- square = rf.read_gbq_function(
+ square = bff.read_gbq_function(
str(routine.reference),
session=session,
)
@@ -757,7 +757,7 @@ def test_read_gbq_function_reads_udfs(session, bigquery_client, dataset_id):
src = {"x": [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5]}
- routine_ref_str = rf_utils.routine_ref_to_string_for_query(routine.reference)
+ routine_ref_str = bff_utils.routine_ref_to_string_for_query(routine.reference)
direct_sql = " UNION ALL ".join(
[f"SELECT {x} AS x, {routine_ref_str}({x}) AS y" for x in src["x"]]
)
@@ -818,7 +818,7 @@ def test_read_gbq_function_requires_explicit_types(
bigquery_client.create_routine(only_arg_type_specified, exists_ok=True)
bigquery_client.create_routine(neither_type_specified, exists_ok=True)
- rf.read_gbq_function(
+ bff.read_gbq_function(
str(both_types_specified.reference),
session=session,
)
@@ -826,17 +826,17 @@ def test_read_gbq_function_requires_explicit_types(
bigframes.exceptions.UnknownDataTypeWarning,
match="missing input data types.*assume default data type",
):
- rf.read_gbq_function(
+ bff.read_gbq_function(
str(only_return_type_specified.reference),
session=session,
)
with pytest.raises(ValueError):
- rf.read_gbq_function(
+ bff.read_gbq_function(
str(only_arg_type_specified.reference),
session=session,
)
with pytest.raises(ValueError):
- rf.read_gbq_function(
+ bff.read_gbq_function(
str(neither_type_specified.reference),
session=session,
)
@@ -878,13 +878,13 @@ def test_read_gbq_function_respects_python_output_type(
body="TO_JSON_STRING([x, x+1, x+2])",
arguments=[arg],
return_type=bigquery.StandardSqlDataType(bigquery.StandardSqlTypeNames.STRING),
- description=rf_utils.get_bigframes_metadata(python_output_type=array_type),
+ description=bff_utils.get_bigframes_metadata(python_output_type=array_type),
type_=bigquery.RoutineType.SCALAR_FUNCTION,
)
# Create the routine in BigQuery and read it back using read_gbq_function.
bigquery_client.create_routine(sql_routine, exists_ok=True)
- func = rf.read_gbq_function(str(sql_routine.reference), session=session)
+ func = bff.read_gbq_function(str(sql_routine.reference), session=session)
# test that the function works as expected
s = bigframes.series.Series([1, 10, 100])
@@ -920,7 +920,7 @@ def test_read_gbq_function_supports_python_output_type_only_for_string_outputs(
body="x+1",
arguments=[arg],
return_type=bigquery.StandardSqlDataType(bigquery.StandardSqlTypeNames.INT64),
- description=rf_utils.get_bigframes_metadata(python_output_type=array_type),
+ description=bff_utils.get_bigframes_metadata(python_output_type=array_type),
type_=bigquery.RoutineType.SCALAR_FUNCTION,
)
@@ -933,7 +933,7 @@ def test_read_gbq_function_supports_python_output_type_only_for_string_outputs(
TypeError,
match="An explicit output_type should be provided only for a BigQuery function with STRING output.",
):
- rf.read_gbq_function(str(sql_routine.reference), session=session)
+ bff.read_gbq_function(str(sql_routine.reference), session=session)
@pytest.mark.parametrize(
@@ -959,13 +959,13 @@ def test_read_gbq_function_supported_python_output_type(
body="CAST(x AS STRING)",
arguments=[arg],
return_type=bigquery.StandardSqlDataType(bigquery.StandardSqlTypeNames.STRING),
- description=rf_utils.get_bigframes_metadata(python_output_type=array_type),
+ description=bff_utils.get_bigframes_metadata(python_output_type=array_type),
type_=bigquery.RoutineType.SCALAR_FUNCTION,
)
# Create the routine in BigQuery and read it back using read_gbq_function.
bigquery_client.create_routine(sql_routine, exists_ok=True)
- rf.read_gbq_function(str(sql_routine.reference), session=session)
+ bff.read_gbq_function(str(sql_routine.reference), session=session)
@pytest.mark.flaky(retries=2, delay=120)
diff --git a/tests/system/small/test_series.py b/tests/system/small/test_series.py
index 670828f616..3d76122e9d 100644
--- a/tests/system/small/test_series.py
+++ b/tests/system/small/test_series.py
@@ -17,6 +17,7 @@
import re
import tempfile
+import db_dtypes # type: ignore
import geopandas as gpd # type: ignore
import numpy
from packaging.version import Version
@@ -281,7 +282,7 @@ def test_get_column(scalars_dfs, col_name, expected_dtype):
def test_get_column_w_json(json_df, json_pandas_df):
series = json_df["json_col"]
series_pandas = series.to_pandas()
- assert series.dtype == pd.ArrowDtype(pa.large_string())
+ assert series.dtype == db_dtypes.JSONDtype()
assert series_pandas.shape[0] == json_pandas_df.shape[0]
diff --git a/tests/system/utils.py b/tests/system/utils.py
index 83d0e683bc..7c12c8033a 100644
--- a/tests/system/utils.py
+++ b/tests/system/utils.py
@@ -26,7 +26,7 @@
import pyarrow as pa # type: ignore
import pytest
-import bigframes.functions._utils as functions_utils
+import bigframes.functions._utils as bff_utils
import bigframes.pandas
ML_REGRESSION_METRICS = [
@@ -351,7 +351,7 @@ def get_cloud_functions(
not name or not name_prefix
), "Either 'name' or 'name_prefix' can be passed but not both."
- _, location = functions_utils.get_remote_function_locations(location)
+ _, location = bff_utils.get_remote_function_locations(location)
parent = f"projects/{project}/locations/{location}"
request = functions_v2.ListFunctionsRequest(parent=parent)
page_result = functions_client.list_functions(request=request)
diff --git a/tests/unit/functions/test_remote_function_template.py b/tests/unit/functions/test_function_template.py
similarity index 92%
rename from tests/unit/functions/test_remote_function_template.py
rename to tests/unit/functions/test_function_template.py
index 70b033d938..11db01ed9e 100644
--- a/tests/unit/functions/test_remote_function_template.py
+++ b/tests/unit/functions/test_function_template.py
@@ -20,7 +20,7 @@
import pytest
import bigframes.dtypes
-import bigframes.functions.remote_function_template as remote_function_template
+import bigframes.functions.function_template as bff_template
HELLO_WORLD_BASE64_BYTES = b"SGVsbG8sIFdvcmxkIQ=="
HELLO_WORLD_BASE64_STR = "SGVsbG8sIFdvcmxkIQ=="
@@ -59,7 +59,7 @@
),
)
def test_convert_from_bq_json(type_, json_value, expected):
- got = remote_function_template.convert_from_bq_json(type_, json_value)
+ got = bff_template.convert_from_bq_json(type_, json_value)
assert got == expected
@@ -76,7 +76,7 @@ def test_convert_from_bq_json(type_, json_value, expected):
],
)
def test_convert_from_bq_json_none(type_):
- got = remote_function_template.convert_from_bq_json(type_, None)
+ got = bff_template.convert_from_bq_json(type_, None)
assert got is None
@@ -113,7 +113,7 @@ def test_convert_from_bq_json_none(type_):
),
)
def test_convert_to_bq_json(type_, value, expected):
- got = remote_function_template.convert_to_bq_json(type_, value)
+ got = bff_template.convert_to_bq_json(type_, value)
assert got == expected
@@ -130,7 +130,7 @@ def test_convert_to_bq_json(type_, value, expected):
],
)
def test_convert_to_bq_json_none(type_):
- got = remote_function_template.convert_to_bq_json(type_, None)
+ got = bff_template.convert_to_bq_json(type_, None)
assert got is None
@@ -176,7 +176,7 @@ def test_convert_to_bq_json_none(type_):
),
)
def test_get_pd_series(row_json, expected):
- got = remote_function_template.get_pd_series(row_json)
+ got = bff_template.get_pd_series(row_json)
pandas.testing.assert_series_equal(got, expected)
diff --git a/tests/unit/polars_session.py b/tests/unit/polars_session.py
index dfb1f5bfa6..cffd8ff7ca 100644
--- a/tests/unit/polars_session.py
+++ b/tests/unit/polars_session.py
@@ -82,7 +82,7 @@ def __init__(self):
self._allow_ambiguity = False # type: ignore
self._default_index_type = bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64
self._metrics = bigframes.session.metrics.ExecutionMetrics()
- self._remote_function_session = None # type: ignore
+ self._function_session = None # type: ignore
self._temp_storage_manager = None # type: ignore
self._executor = TestExecutor()
self._loader = None # type: ignore
diff --git a/tests/unit/test_remote_function.py b/tests/unit/test_remote_function.py
index a8c4f2ac2e..413a694680 100644
--- a/tests/unit/test_remote_function.py
+++ b/tests/unit/test_remote_function.py
@@ -21,7 +21,7 @@
import bigframes.core.compile.ibis_types
import bigframes.dtypes
-import bigframes.functions.remote_function
+import bigframes.functions.function as bff
import bigframes.series
from tests.unit import resources
@@ -42,9 +42,7 @@
def test_series_input_types_to_str(series_type):
"""Check that is_row_processor=True uses str as the input type to serialize a row."""
session = resources.create_bigquery_session()
- remote_function_decorator = bigframes.functions.remote_function.remote_function(
- session=session
- )
+ remote_function_decorator = bff.remote_function(session=session)
with pytest.warns(
bigframes.exceptions.PreviewWarning,
@@ -75,9 +73,7 @@ def test_supported_types_correspond():
def test_missing_input_types():
session = resources.create_bigquery_session()
- remote_function_decorator = bigframes.functions.remote_function.remote_function(
- session=session
- )
+ remote_function_decorator = bff.remote_function(session=session)
def function_without_parameter_annotations(myparam) -> str:
return str(myparam)
@@ -93,9 +89,7 @@ def function_without_parameter_annotations(myparam) -> str:
def test_missing_output_type():
session = resources.create_bigquery_session()
- remote_function_decorator = bigframes.functions.remote_function.remote_function(
- session=session
- )
+ remote_function_decorator = bff.remote_function(session=session)
def function_without_return_annotation(myparam: int):
return str(myparam)
diff --git a/third_party/bigframes_vendored/pandas/core/frame.py b/third_party/bigframes_vendored/pandas/core/frame.py
index c8ca1b74b5..f5aa23d00b 100644
--- a/third_party/bigframes_vendored/pandas/core/frame.py
+++ b/third_party/bigframes_vendored/pandas/core/frame.py
@@ -2048,6 +2048,98 @@ def where(self, cond, other):
"""
raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)
+ def mask(self, cond, other):
+ """Replace values where the condition is False.
+
+ **Examples:**
+
+ >>> import bigframes.pandas as bpd
+ >>> bpd.options.display.progress_bar = None
+
+ >>> df = bpd.DataFrame({'a': [20, 10, 0], 'b': [0, 10, 20]})
+ >>> df
+ a b
+ 0 20 0
+ 1 10 10
+ 2 0 20
+
+ [3 rows x 2 columns]
+
+ You can filter the values in the dataframe based on a condition. The
+ values matching the condition would be kept, and not matching would be
+ replaced. The default replacement value is ``NA``. For example, when the
+ condition is a dataframe:
+
+ >>> df.mask(df > 0)
+ a b
+ 0 0
+ 1
+ 2 0
+
+ [3 rows x 2 columns]
+
+ You can specify a custom replacement value for non-matching values.
+
+ >>> df.mask(df > 0, -1)
+ a b
+ 0 -1 0
+ 1 -1 -1
+ 2 0 -1
+
+ [3 rows x 2 columns]
+
+ Besides dataframe, the condition can be a series too. For example:
+
+ >>> df.mask(df['a'] > 10, -1)
+ a b
+ 0 -1 -1
+ 1 10 10
+ 2 0 20
+
+ [3 rows x 2 columns]
+
+ As for the replacement, it can be a dataframe too. For example:
+
+ >>> df.mask(df > 10, -df)
+ a b
+ 0 -20 0
+ 1 10 10
+ 2 0 -20
+
+ [3 rows x 2 columns]
+
+ >>> df.mask(df['a'] > 10, -df)
+ a b
+ 0 -20 0
+ 1 10 10
+ 2 0 20
+
+ [3 rows x 2 columns]
+
+ Please note, replacement doesn't support Series for now. In pandas, when
+ specifying a Series as replacement, the axis value should be specified
+ at the same time, which is not supported in bigframes DataFrame.
+
+ Args:
+ cond (bool Series/DataFrame, array-like, or callable):
+ Where cond is False, keep the original value. Where True, replace
+ with corresponding value from other. If cond is callable, it is
+ computed on the Series/DataFrame and returns boolean
+ Series/DataFrame or array. The callable must not change input
+ Series/DataFrame (though pandas doesn’t check it).
+ other (scalar, DataFrame, or callable):
+ Entries where cond is True are replaced with corresponding value
+ from other. If other is callable, it is computed on the
+ DataFrame and returns scalar or DataFrame. The callable must not
+ change input DataFrame (though pandas doesn’t check it). If not
+ specified, entries will be filled with the corresponding NULL
+ value (np.nan for numpy dtypes, pd.NA for extension dtypes).
+
+ Returns:
+ DataFrame: DataFrame after the replacement.
+ """
+ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)
+
# ----------------------------------------------------------------------
# Sorting
@@ -4054,6 +4146,47 @@ def cov(self, *, numeric_only) -> DataFrame:
"""
raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)
+ def corrwith(
+ self,
+ other,
+ *,
+ numeric_only: bool = False,
+ ):
+ """
+ Compute pairwise correlation.
+
+ Pairwise correlation is computed between rows or columns of
+ DataFrame with rows or columns of Series or DataFrame. DataFrames
+ are first aligned along both axes before computing the
+ correlations.
+
+ **Examples:**
+ >>> import bigframes.pandas as bpd
+ >>> bpd.options.display.progress_bar = None
+
+ >>> index = ["a", "b", "c", "d", "e"]
+ >>> columns = ["one", "two", "three", "four"]
+ >>> df1 = bpd.DataFrame(np.arange(20).reshape(5, 4), index=index, columns=columns)
+ >>> df2 = bpd.DataFrame(np.arange(16).reshape(4, 4), index=index[:4], columns=columns)
+ >>> df1.corrwith(df2)
+ one 1.0
+ two 1.0
+ three 1.0
+ four 1.0
+ dtype: Float64
+
+ Args:
+ other (DataFrame, Series):
+ Object with which to compute correlations.
+
+ numeric_only (bool, default False):
+ Include only `float`, `int` or `boolean` data.
+
+ Returns:
+ bigframes.pandas.Series: Pairwise correlations.
+ """
+ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)
+
def update(
self, other, join: str = "left", overwrite: bool = True, filter_func=None
) -> DataFrame:
diff --git a/third_party/bigframes_vendored/pandas/core/tools/timedeltas.py b/third_party/bigframes_vendored/pandas/core/tools/timedeltas.py
new file mode 100644
index 0000000000..9442e965fa
--- /dev/null
+++ b/third_party/bigframes_vendored/pandas/core/tools/timedeltas.py
@@ -0,0 +1,99 @@
+# Contains code from https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/tools/timedeltas.py
+
+import typing
+
+from bigframes_vendored import constants
+import pandas as pd
+
+from bigframes import series
+
+UnitChoices = typing.Literal[
+ "W",
+ "w",
+ "D",
+ "d",
+ "days",
+ "day",
+ "hours",
+ "hour",
+ "hr",
+ "h",
+ "m",
+ "minute",
+ "min",
+ "minutes",
+ "s",
+ "seconds",
+ "sec",
+ "second",
+ "ms",
+ "milliseconds",
+ "millisecond",
+ "milli",
+ "millis",
+ "us",
+ "microseconds",
+ "microsecond",
+ "µs",
+ "micro",
+ "micros",
+]
+
+
+def to_timedelta(
+ arg: typing.Union[series.Series, str, int, float],
+ unit: typing.Optional[UnitChoices] = None,
+) -> typing.Union[series.Series, pd.Timedelta]:
+ """
+ Converts a scalar or Series to a timedelta object.
+
+ .. note::
+ BigQuery only supports precision up to microseconds (us). Therefore, when working
+ with timedeltas that have a finer granularity than microseconds, be aware that
+ the additional precision will not be represented in BigQuery.
+
+ **Examples:**
+
+ >>> import bigframes.pandas as bpd
+ >>> bpd.options.display.progress_bar = None
+
+ Converting a Scalar to timedelta
+
+ >>> scalar = 2
+ >>> bpd.to_timedelta(scalar, unit='s')
+ Timedelta('0 days 00:00:02')
+
+ Converting a Series of integers to a Series of timedeltas
+
+ >>> int_series = bpd.Series([1,2,3])
+ >>> bpd.to_timedelta(int_series, unit='s')
+ 0 0 days 00:00:01
+ 1 0 days 00:00:02
+ 2 0 days 00:00:03
+ dtype: duration[us][pyarrow]
+
+ Args:
+ arg (int, float, str, Series):
+ The object to convert to a dataframe
+ unit (str, default 'us'):
+ Denotes the unit of the arg for numeric `arg`. Defaults to ``"us"``.
+
+ Possible values:
+
+ * 'W'
+ * 'D' / 'days' / 'day'
+ * 'hours' / 'hour' / 'hr' / 'h' / 'H'
+ * 'm' / 'minute' / 'min' / 'minutes'
+ * 's' / 'seconds' / 'sec' / 'second'
+ * 'ms' / 'milliseconds' / 'millisecond' / 'milli' / 'millis'
+ * 'us' / 'microseconds' / 'microsecond' / 'micro' / 'micros'
+
+ Returns:
+ Union[pandas.Timedelta, bigframes.pandas.Series]:
+ Return type depends on input
+ - Series: Series of duration[us][pyarrow] dtype
+ - scalar: timedelta
+
+ """
+
+ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)
diff --git a/third_party/bigframes_vendored/version.py b/third_party/bigframes_vendored/version.py
index 50dde36b01..1fef294cef 100644
--- a/third_party/bigframes_vendored/version.py
+++ b/third_party/bigframes_vendored/version.py
@@ -12,4 +12,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-__version__ = "1.33.0"
+__version__ = "1.34.0"