Page MenuHomePhabricator

Add dbt related packages to conda-analytics
Open, MediumPublic

Description

Whilst engineers are able to use conda and python environments to install thgeior own copies of dbt, it would be better to have something that is available for use system-wide and across all of the stat servers uniformly.

For this reason, we should explore whether it is convenient to add it to conda-analytics.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add dbt-core and relevant connectors to conda-analyticsrepos/data-engineering/conda-analytics!59btullisadd_dbtmain
Customize query in GitLab

Event Timeline

I have this patch to conda-analytics for review:
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/59

Crucially, this adds the following dbt components to the conda-analytics base environment:

  • dbt-core==1.10.13
  • dbt-spark==1.9.3
  • dbt-trino==1.9.3

If you're happy with it in principle, then the next steps are:

  • [Optional] merge and release at this point.
  • Build a conda-analytics-0.0.39.deb file and put it on apt.wikimedia.org
  • Deploy this deb file to the hadoop test cluster
  • Run some tests, including:
    • Airflow jobs on the analytics-test instance running spark
    • JupyterHub on an-test-client1002
    • conda-analytics-clone on an-test-client1002
  • Merge and deploy (unless already done earlier)
  • Push out to the production hadoop cluster.

I have built a version 0.0.39 of conda-analytics and added it to apt.wikimedia.org
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/jobs/653391

btullis@apt1002:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb
--2025-10-21 15:35:55--  https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb
Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:861:2:208:80:154:145, 208.80.154.145
Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:861:2:208:80:154:145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1094619624 (1.0G) [binary/octet-stream]
Saving to: ‘conda-analytics-0.0.39_amd64.deb’

conda-analytics-0.0 100%[===================>]   1.02G   107MB/s    in 9.7s    

2025-10-21 15:36:04 (108 MB/s) - ‘conda-analytics-0.0.39_amd64.deb’ saved [1094619624/1094619624]

btullis@apt1002:~$ sudo -i reprepro -C main includedeb bullseye-wikimedia `pwd`/conda-analytics-0.0.39_amd64.deb
Exporting indices...
btullis@apt1002:~$ sudo -i reprepro -C main includedeb bookworm-wikimedia `pwd`/conda-analytics-0.0.39_amd64.deb
Exporting indices...
Deleting files no longer referenced...
btullis@apt1002:~$

btullis@apt1002:~$ sudo -i reprepro ls conda-analytics
conda-analytics | 0.0.35 |   buster-wikimedia | amd64
conda-analytics | 0.0.39 | bullseye-wikimedia | amd64
conda-analytics | 0.0.39 | bookworm-wikimedia | amd64
btullis@apt1002:~$

I will deploy this to the hadoop-test cluster tomorrow, unless anyone requests that I do not.

@BTullis wouldn't this approach introduce a discrepancy between what users use on stat boxes and what is run in GitLab CI/CD and eventually in Airflow? The latter two will run in Docker images, and I wonder how different the two installations will end up being.

@BTullis wouldn't this approach introduce a discrepancy between what users use on stat boxes and what is run in GitLab CI/CD and eventually in Airflow? The latter two will run in Docker images, and I wonder how different the two installations will end up being.

Yes, that's true. But conda-analytics isn't necessarily a long-term solution. I'd be much happier to start out with a container based solution as per: T406636: Create a dbt Docker container but container runtimes are not available to us on the stat servers at the moment.

At least this way, we will have something unform to work with already on the stat servers.

I pushed out the version 0.0.39 package to the test-cluster.

btullis@cumin1003:~$ generate-debdeploy-spec 
<snip>

btullis@cumin1003:~$ cat 2025-10-22-conda-analytics.yaml 
comment: T406766
fixes:
  bookworm: 0.0.39
  bullseye: 0.0.39
  buster: ''
  trixie: ''
libraries: []
source: conda-analytics
transitions: {}
update_type: tool

btullis@cumin1003:~$ sudo debdeploy deploy -u 2025-10-22-conda-analytics.yaml -s hadoop-test
Rolling out conda-analytics:
Non-daemon update, no service restart needed

conda-analytics was updated: 0.0.38 -> 0.0.39
  an-test-client1002.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-
test-master[1001-1002].eqiad.wmnet,an-test-
worker[1001-1003].eqiad.wmnet (7 hosts)

The package to be updated isn't installed on these hosts:
  an-test-ui1001.eqiad.wmnet (1 hosts)

Now I can activate this environment on an-test-client1002 and check the versions.

btullis@an-test-client1002:~$ source conda-analytics-activate base

(base) btullis@an-test-client1002:~$ dbt --version
INFO:trino.auth:keyring module not found. OAuth2 token will not be stored in keyring.
INFO:trino.auth:keyring module not found. OAuth2 token will not be stored in keyring.
Core:
  - installed: 1.10.13

  The latest version of dbt-core could not be determined!
  Make sure that the following URL is accessible:
  https://pypi.org/pypi/dbt-core/json

Plugins:
  - spark: 1.9.3 - Could not determine latest version
  - trino: 1.9.3 - Could not determine latest version

It wants to reach out to https://pypi.org/pypi/dbt-core/json to check the versions, so we can enable the HTTP proxy.

(base) btullis@an-test-client1002:~$ set_proxy
Proxy set
(base) btullis@an-test-client1002:~$ dbt --version
INFO:trino.auth:keyring module not found. OAuth2 token will not be stored in keyring.
INFO:trino.auth:keyring module not found. OAuth2 token will not be stored in keyring.
Core:
  - installed: 1.10.13
  - latest:    1.10.13 - Up to date!

Plugins:
  - spark: 1.9.3 - Up to date!
  - trino: 1.9.3 - Up to date!

So this is all ready for testing on an-test-client1002.

I'll address the comments on https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/59 and, if needed, build an updated version.

We got this working with spark in session mode, using the dbt-core and dbt-spark packages in conda-analytics version 0.0.39.

SSH to the host and activate the base environment

btullis@barracuda:~$ ssh an-test-client1002.eqiad.wmnet
btullis@an-test-client1002:~$ source conda-analytics-activate base

Clone the dbt repo and checkout our feature branch:

(base) btullis@an-test-client1002:~$ git clone https://gitlab.wikimedia.org/repos/data-engineering/dbt.git
Cloning into 'dbt'...
remote: Enumerating objects: 71, done.
remote: Counting objects: 100% (71/71), done.
remote: Compressing objects: 100% (60/60), done.
remote: Total 71 (delta 34), reused 0 (delta 0), pack-reused 0 (from 0)
Receiving objects: 100% (71/71), 120.70 KiB | 3.45 MiB/s, done.
Resolving deltas: 100% (34/34), done.
(base) btullis@an-test-client1002:~$ cd dbt/
(base) btullis@an-test-client1002:~/dbt$ git checkout feature/dbt-setup
Branch 'feature/dbt-setup' set up to track remote branch 'feature/dbt-setup' from 'origin'.
Switched to a new branch 'feature/dbt-setup'

We created a profile as per: https://docs.getdbt.com/docs/core/connect-data-platform/spark-setup#session
This could have been in ~/.dbt/profiles.yml but I chose to put mine into the root of the repo.

(base) btullis@an-test-client1002:~/dbt$ cat profiles.yml 
datalake:
  target: dev
  outputs:
    dev:
      type: spark
      method: session
      schema: btullis
      host: NA                           # not used, but required by `dbt-core`
      server_side_parameters:
        "spark.driver.memory": "4g"

Check with dbt debug that everything looks OK.

(base) btullis@an-test-client1002:~/dbt$ dbt debug
16:12:46  Running with dbt=1.10.13
16:12:46  dbt version: 1.10.13
16:12:46  python version: 3.10.19
16:12:46  python path: /home/btullis/.conda/envs/dbt/bin/python
16:12:46  os info: Linux-5.10.0-30-amd64-x86_64-with-glibc2.31
16:12:47  Using profiles dir at /home/btullis/dbt
16:12:47  Using profiles.yml file at /home/btullis/dbt/profiles.yml
16:12:47  Using dbt_project.yml file at /home/btullis/dbt/dbt_project.yml
16:12:47  adapter type: spark
16:12:47  adapter version: 1.9.3
16:12:47  Configuration:
16:12:47    profiles.yml file [OK found and valid]
16:12:47    dbt_project.yml file [OK found and valid]
16:12:47  Required dependencies:
16:12:47   - git [OK found]

16:12:47  Connection:
16:12:47    host: NA
16:12:47    port: 443
16:12:47    cluster: None
16:12:47    endpoint: None
16:12:47    schema: btullis
16:12:47    organization: 0
16:12:47  Registered adapter: spark=1.9.3
SPARK_HOME: /home/btullis/.conda/envs/dbt/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/home/btullis/.conda/envs/dbt/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/22 16:12:51 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
25/10/22 16:12:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/10/22 16:12:59 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
16:13:02    Connection test: [OK connection ok]

16:13:02  All checks passed!

Install dependencies.

(dbt) btullis@an-test-client1002:~/dbt$ dbt deps
16:13:47  Running with dbt=1.10.13
16:13:48  Installing dbt-labs/dbt_utils
16:13:48  Installed from version 1.3.1
16:13:48  Up to date!

Run the model with dbt run

(dbt) btullis@an-test-client1002:~/dbt$ dbt run
16:13:54  Running with dbt=1.10.13
16:13:55  Registered adapter: spark=1.9.3
16:13:55  Unable to do partial parsing because saved manifest not found. Starting full parse.
16:13:57  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 3 unused configuration paths:
- models.datalake.intermediate
- models.datalake.staging
- models.datalake.marts
16:13:57  Found 2 models, 4 data tests, 590 macros
16:13:57  
16:13:57  Concurrency: 1 threads (target='dev')
16:13:57  
SPARK_HOME: /home/btullis/.conda/envs/dbt/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/home/btullis/.conda/envs/dbt/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/22 16:13:59 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
25/10/22 16:14:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/10/22 16:14:07 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
16:14:13  1 of 2 START sql table model btullis.my_first_dbt_model ........................ [RUN]
25/10/22 16:14:14 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
25/10/22 16:14:14 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
16:14:23  1 of 2 OK created sql table model btullis.my_first_dbt_model ................... [OK in 9.68s]
16:14:23  2 of 2 START sql view model btullis.my_second_dbt_model ........................ [RUN]
16:14:23  2 of 2 OK created sql view model btullis.my_second_dbt_model ................... [OK in 0.35s]
16:14:23  
16:14:23  Finished running 1 table model, 1 view model in 0 hours 0 minutes and 26.24 seconds (26.24s).
16:14:23  
16:14:23  Completed successfully
16:14:23  
16:14:23  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=2

Since I had rebuilt version 0.0.39 of conda-analytics, I updated the version on the apt servers.

btullis@apt1002:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb
--2025-10-22 17:20:30--  https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.39/conda-analytics-0.0.39_amd64.deb
Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:861:2:208:80:154:145, 208.80.154.145
Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:861:2:208:80:154:145|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1094453352 (1.0G) [binary/octet-stream]
Saving to: ‘conda-analytics-0.0.39_amd64.deb’

conda-analytics-0.0.39_amd64.deb                     100%[=====================================================================================================================>]   1.02G   109MB/s    in 9.6s    

2025-10-22 17:20:40 (108 MB/s) - ‘conda-analytics-0.0.39_amd64.deb’ saved [1094453352/1094453352]

btullis@apt1002:~$ sudo -i reprepro -C main remove bookworm-wikimedia conda-analytics
Exporting indices...
btullis@apt1002:~$ sudo -i reprepro -C main remove bullseye-wikimedia conda-analytics
Exporting indices...
Deleting files no longer referenced...
btullis@apt1002:~$ sudo -i reprepro -C main includedeb bullseye-wikimedia `pwd`/conda-analytics-0.0.39_amd64.deb
Exporting indices...
btullis@apt1002:~$ sudo -i reprepro -C main includedeb bookworm-wikimedia `pwd`/conda-analytics-0.0.39_amd64.deb
Exporting indices...
btullis@apt1002:~$

I'll make one final check of this package on the test cluster tomorrow, then push it out to production.

Yes, that's true. But conda-analytics isn't necessarily a long-term solution. I'd be much happier to start out with a container based solution as per: T406636: Create a dbt Docker container but container runtimes are not available to us on the stat servers at the moment.

At least this way, we will have something unform to work with already on the stat servers.

Sounds good to me!

I'm deploying version 0.0.39 to production now with:

btullis@cumin1003:~$ sudo debdeploy deploy -u 2025-10-22-conda-analytics.yaml -Q 'R:Package = conda-analytics'
Rolling out conda-analytics:
Non-daemon update, no service restart needed