I am trying to determine which variables affect the likelihood of lumpfish eating salmon lice and which variables predict the number of lice eaten. My data are highly zero-inflated, so I decided to use a hurdle model (hurdle()) from the pscl package. However, since I have several predictors that do not seem to have a linear relationship with either the likelihood or the number of lice eaten, I incorporated splines (ns()) into the model. Here is the model with the lowest AIC:
Call:
hurdle(formula = Laksalus_tal ~ ns(Total_laksalus_rolling, df = 3) + ns(Vekt, df = 5) + Verur_a_notini | ns(Total_laksalus_rolling, df = 3) + ns(Vekt, df = 5) +
Hvalspýggj + Lice_per_ring_baseline + Djóraæti_uttanjelly + Verur_a_notini, data = data, dist = "negbin", zero.dist = "binomial")
Pearson residuals:
Min 1Q Median 3Q Max
-0.26391 -0.11924 -0.08661 -0.06190 81.26243
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.9943 87.5056 -0.148 0.881950
ns(Total_laksalus_rolling, df = 3)1 1.1977 0.6326 1.893 0.058335 .
ns(Total_laksalus_rolling, df = 3)2 4.9079 0.9588 5.119 3.08e-07 ***
ns(Total_laksalus_rolling, df = 3)3 2.7670 1.7266 1.603 0.109025
ns(Vekt, df = 5)1 1.5328 0.8386 1.828 0.067579 .
ns(Vekt, df = 5)2 0.3391 1.0679 0.318 0.750796
ns(Vekt, df = 5)3 7.0208 2.2309 3.147 0.001649 **
ns(Vekt, df = 5)4 -32.1159 9.4733 -3.390 0.000699 ***
ns(Vekt, df = 5)5 -68.8257 19.4001 -3.548 0.000389 ***
Verur_a_notini1 -1.2374 0.1734 -7.136 9.63e-13 ***
Log(theta) -14.3543 87.5002 -0.164 0.869692
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.482984 0.444500 -12.335 < 2e-16 ***
ns(Total_laksalus_rolling, df = 3)1 0.187322 0.267284 0.701 0.48341
ns(Total_laksalus_rolling, df = 3)2 2.986707 0.354080 8.435 < 2e-16 ***
ns(Total_laksalus_rolling, df = 3)3 3.154348 0.524225 6.017 1.77e-09 ***
ns(Vekt, df = 5)1 1.836310 0.391115 4.695 2.67e-06 ***
ns(Vekt, df = 5)2 1.122467 0.473999 2.368 0.01788 *
ns(Vekt, df = 5)3 -1.296933 0.752062 -1.725 0.08462 .
ns(Vekt, df = 5)4 2.086199 1.967267 1.060 0.28894
ns(Vekt, df = 5)5 -1.192533 3.779546 -0.316 0.75236
Hvalspýggj1 -0.955653 0.091983 -10.389 < 2e-16 ***
Lice_per_ring_baseline 0.027207 0.009029 3.013 0.00258 **
Djóraæti_uttanjelly1 0.285546 0.113077 2.525 0.01156 *
Verur_a_notini1 -0.308816 0.078037 -3.957 7.58e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0
Number of iterations in BFGS optimization: 58
Log-likelihood: -6399 on 24 Df
Total_laksalus_rolling, Lice_per_ring_baseline and Vekt are numerical predictors, while the other variables are binary (0/1).
My questions:
Model evaluation: How should I evaluate the overall fit and appropriateness of this hurdle model? Are there specific diagnostics or metrics recommended for assessing hurdle models?
Interpretation of splines: How do I interpret the spline terms (ns())? Specifically, how can I understand the relationship between predictors (Total_laksalus_rolling and Vekt) and both the likelihood of lice consumption (zero hurdle) and the number of lice eaten (count component)?
Below are residual diagnostics plots:


hurdle()function you used. Second, don't forget that the default locations of boundary knots inns()are at the extreme values. You might try boundary knots within the range (e.g., 5th and 95th percentiles). Third, look at this answer for guidance on hurdle versus zero-inflated models; might zero inflated be better here?. Fourth, are you worried by the extreme (and poorly estimated) value ofthetain thenegbinmodel? $\endgroup$