Quantifying microbial interactions based on compositional data using an iterative approach for solving generalized Lotka-Volterra equations

0
1

Introduction
Quantifying microbial interactions is essential for understanding the dynamics of microbial communities. This task has gained increased attention in microbiome research, as microbial interactions directly impact community stability and host health [1–3]. The Lotka-Volterra (LV) modeling framework was originally developed to describe predator-prey systems [4], based on which the generalized Lotka-Volterra (gLV) was proposed to model systems of more than two species. gLV has been widely adopted to infer these interactions due to its versatility in modeling nonlinear dynamics across diverse ecosystems, from microbial communities to macroscopic populations [5–7]. The generalized Lotka-Volterra (gLV) model remains a key tool for describing microbial interactions and forecasting community trajectories. Applications range from predicting microbiome responses to antibiotics and dietary shifts to understanding microbial coexistence in ecological contexts [8,9].
However, the widespread use of sequencing technologies in microbial studies introduces a unique challenge: most data are compositional, providing relative rather than absolute abundance profiles [10,11]. This poses significant limitations for traditional gLV models, which rely on absolute abundance inputs for accurate parameterization and prediction [12,13]. To address this problem, compositional adaptations of gLV models have been developed, such as the compositional Lotka-Volterra (cLV) framework, which maps dynamics onto a constrained simplex, accommodating the summation constraint of relative abundances [13].
These previous advancements have highlighted the potential of compositional data analysis, but they also have some major limitations, such as moderate accuracy due to their nature of linear approximations. For cLV, the coefficients in the gLV model cannot be fully recovered [13], making it harder to interpret the results. Gaps remain in integrating theoretical and empirical approaches for compositional data. To bridge the gap, we proposed a novel method (called iterative Lotka-Volterra or iLV) of compositional data analysis based on the traditional gLV model, emphasizing the compositional constraints and their implications for model parameterization. There are two major innovative components of iLV, including defining the classical generalized Lotka-Volterra model with relative abundances and inter-species sum of absolute abundances, and iterative linear approximations followed by non-linear optimizations. By leveraging computational simulations, we compared the performance of iLV with other existing methods including generalized Lotka–Volterra (gLV) and compositional Lotka-Volterra (cLV). We also applied iLV in three real-world datasets to estimate the coefficients of gLV model and accurately recover community relative abundance trajectories.
Results
Performance and stability of different non-linear optimization methods
When applying the iLV algorithm to parameter estimation, the choice of non-linear optimization methods in Subroutine 2 can influence both accuracy and numerical stability, particularly when the problem is ill-conditioned. Fig 1 presents an example where instability occurred for different non-linear optimization methods. We simulated data points using the following parameter setting that reflects periodic oscillations among three species: , , , , , , , , , , , , , with a time length of 20 and a time interval of 0.4. We executed the iLV Algorithm 20 times using the simulated data and estimated the parameters with three different non-linear optimization methods: leastsq(), least_squares(method = ‘lm’), and least_squares(method=’trf’). was set to 200. As shown in Fig 1, all three methods exhibited instabilities potentially due to rounding errors and the ill-conditioned nature of this dataset, and leastsq() achieved the lowest trajectory RMSE of . Therefore, to mitigate the effects of numerical instabilities and the variable performance of optimization methods across different datasets, we compared the trajectory RMSE returned by least_squares (method = “trf” or method = “lm”) and leastsq() in Subroutine 2, and retained the results with the lowest RMSE. Additionally, during benchmarking, we repeated the algorithm 20 times and reported the parameter that yielded the lowest RMSE among these 20 runs.
Data points were simulated using the following parameter setting that reflects periodic oscillations among three species: , , , , , , , , , , , , , with a time length of 20 and a time interval of 0.4. All three methods exhibited instabilities for this dataset, while leastsq() achieved the lowest trajectory RMSE of . The mean RMSEs across 20 runs for leastsq(), least_squares(method = ’lm’), and least_squares(method = ’trf’) were 0.0687, 0.135, and 0.103, respectively, while their corresponding median RMSEs were 0.0386, 0.0958, and 0.0945.
https://doi.org/10.1371/journal.pcbi.1013691.g001
Two subroutines of the iLV Algorithm jointly improve parameter estimation
An accurate initial guess of parameters is crucial for the success of optimization functions like leastsq() or least_squares() in finding an optimal local minimum of the cost function. Subroutine 1 (the iterative subroutine) of the iLV Algorithm provides an iterative approach to generate such initial guesses effectively. To illustrate the impact of the iterative subroutine, we generated simulated data using the following parameter settings: , , , , , , , , , , , , , time range = 20, time interval = 1. was set to 200. Fig 2A shows the trajectory RMSE as a function of the number of iterations in Subroutine 1 (the iterative subroutine). The figure indicates that Subroutine 1 iteratively refined the initial guess, with the optimal guess achieved in the 13th iteration out of 100. Fig 2B and 2C illustrate how Subroutine 1 substantially improved the fit between predicted and observed relative abundances. Without applying Subroutine 1, using the parameter estimations derived from the gLV linear approximation as the starting point of Subroutine 2 (the least square estimation subroutine) resulted in poor optimization performance, with an RMSE of 0.122 for leastsq() (Fig 2C). In contrast, using the parameters from the iteration with the lowest trajectory RMSE within the first 100 iterations significantly improved optimization performance, yielding an RMSE of with leastsq() (Fig 2B). Under this parameter setting, least_squares(method = ‘lm’ or ‘trf’) did not perform as well as leastsq(). Notably, the iterative subroutine guarantees non-increasing trajectory RMSE values, as the first iteration corresponds to the result without applying the iterative subroutine, and only the iteration with the lowest trajectory RMSE (e.g., out of the first 100 iterations) is selected as the starting point of the least square estimation subroutine.
Panel A depicts the trajectory RMSE values across 100 iterations of the iterative subroutine in the iLV Algorithm, illustrating its ability to iteratively identify parameter estimations with the lowest RMSE within the first 100 runs. The last three panels compare the predicted relative abundances (solid lines) to observed values (symbols) for each species, with both Subroutines (Panel B), without Subroutine 1 (Panel C), and without Subroutine 2 (Panel D), respectively, demonstrating significantly improved alignment after joint effects of both Subroutines in the iLV Algorithm.
https://doi.org/10.1371/journal.pcbi.1013691.g002
Subroutine 2 helps to improve parameter estimation as leastsq() and least_squares() functions find a local minimum of the cost function F(x), near the initial guess returned by Subroutine 1. As shown in Fig 2D, the trajectory RMSE was 0.169 when only Subroutine 1 was applied without Subroutine 2.
Benchmarking with existing methods
Previously, Joseph et al. [13] developed a nonlinear dynamical system for modeling microbial dynamics using relative abundances, termed the “compositional” Lotka-Volterra (cLV) model, which unifies methodologies based on generalized Lotka-Volterra (gLV) equations with approaches from compositional data analysis. However, cLV cannot fully recover the original interaction coefficients in the generalized Lotka-Volterra model and it also assumes stays roughly constant [13]. There could also be other ways of dealing with relative abundance input, such as applying the generalized Lotka–Volterra (gLV) to relative abundance input directly (we referred to it as gLV_relative). Besides, we used Pearson or Spearman correlation coefficients between the relative abundance data of two microbial species i and j to assess the corresponding interaction coefficients of the generalized Lotka–Volterra model, as two other benchmarked methods. We also applied the gLV model to absolute abundance data for parameter estimation as a baseline for comparison.
To evaluate the performance of these methods, we generated simulated data using the generalized Lotka–Volterra model, represented by the ODE system (4), under two biologically relevant parameter settings without self-interactions. The first parameter setting ( , , , 0, , , , , , , , , , ) reflects periodic oscillations among three species, which are similar to patterns observed in the lynx-hare dynamics [14]. The second parameter setting ( , , , 0, , , , , , , , , , ) describes a scenario where one species eventually dominates while others decline, as observed in the cheese microbial dataset [15].
To test robustness, we added varying levels of noise (no noise, 5%, or 10% random Gaussian noise) to the simulated data. Relative abundance data were generated for different time lengths and intervals. For each noisy group (either 5% or 10% noise), 20 randomly noisy datasets of relative abundance were generated. iLV was used to estimate the interaction coefficients with . The cosine similarities between the estimated and ground truth interaction coefficients ( ) are presented in Tables 1, 2, 4, and 5. For the cLV model, cosine similarities were calculated between the estimated and ground truth values of (where species D has the lowest observed variance of relative abundance among all species and i ≠ j or D), due to the inherent limitations of the cLV framework of not being able to estimate [13]. Trajectory RMSE was not used as an evaluation metric since the system can be unstable growing exponentially if parameter estimation is too far from the ground truth. Moreover, cLV cannot fully recover either.
https://doi.org/10.1371/journal.pcbi.1013691.t001
https://doi.org/10.1371/journal.pcbi.1013691.t002
We also added different levels of self-interaction terms in simulated datasets. Positive self-interactions are rarely biologically meaningful, which usually lead to unstable dynamics [16,17]. Therefore, was randomly assigned from three negative ranges representing different magnitudes of self-regulations: -0.005 to 0 (low), -0.01 to -0.005 (medium), and -0.015 to -0.01 (high). The metric of cosine similarity remains the same as in Tables 1, 2, 4, and 5, and results are presented in Tables 3 and 6 for periodic-oscillation and stabilizing systems, respectively. We incorporated self-interaction terms ( ) into gLV_absolute modeling for results presented in Tables 3 and 6. However, this was not feasible for iLV, cLV, or gLV_relative due to the identifiability issue mentioned in the Methods section. While gLV_absolute generally achieved the best performance using absolute-abundance inputs, iLV consistently outperformed cLV and gLV_relative using relative-abundance inputs. Under all levels of self-interactions, iLV exhibited stable and reliable performance in capturing the relative interaction strengths among species, with cosine similarity consistently exceeding 0.7. By comparison, both cLV and gLV_relative yielded markedly lower cosine similarities (Table 4).
https://doi.org/10.1371/journal.pcbi.1013691.t003
https://doi.org/10.1371/journal.pcbi.1013691.t004
https://doi.org/10.1371/journal.pcbi.1013691.t005
https://doi.org/10.1371/journal.pcbi.1013691.t006
Across all scenarios, iLV performed well against competing methods in recovering interaction coefficients (bij). In noise-free conditions, iLV achieved perfect or near-perfect cosine similarities between predicted and ground-truth coefficients, demonstrating its accuracy and reliability. While gLV-relative and cLV methods are limited by their reliance on linear approximations and the assumption that stays constant, iLV leverages iterative optimization to refine parameter estimates, resulting in significantly improved accuracy under most of the scenarios. Moreover, the Pearson and Spearman coefficients are generally very small in the four tables, indicating that they cannot be used to measure the interaction strength between different species.
The benchmarking results revealed that iLV’s performance is sensitive to temporal resolution, with smaller time intervals (Δt) yielding more accurate estimates. Longer temporal ranges had similar effects on iLV performance but with some exceptions (such as the second parameter setting when t = 20). In contrast, the performance of cLV and gLV-relative methods plateaued or declined under these conditions, underscoring iLV’s ability to exploit temporal information effectively. As noise levels increased, iLV showed higher variance compared to other simple linear approximation methods, since higher noise levels can make the non-linear optimization process harder by further complicating the non-convex objective function.
When comparing methods across all benchmark scenarios, gLV-absolute performed well with absolute abundance data, but its reliance on such input limits its applicability to microbiome studies, where relative abundance is often the only available measure. In contrast, iLV provides comparable accuracy while using relative abundance, demonstrating its versatility and practical relevance.
The snowshoe hare-Canadian lynx population dynamics and the Stylonychia pustula-P. caudatum mixed culture
In the northern boreal forests of North America, the snowshoe hare (Lepus americanus) serves as the primary food source for the Canadian lynx (Lynx canadensis). When hare populations are abundant, lynx consume approximately two hares every three days, primarily excluding other prey from their diet. As a result, the population dynamics of these two species are tightly interconnected. For over a century, it has been observed that their populations underwent significant fluctuations in cycles lasting approximately 8–11 years [14].
We utilized the absolute abundance data of lynx and hare from 1920 to 1935 [18], converting it into relative abundances. To analyze the data, we applied the iLV and gLV_relative models to the relative abundance dataset and gLV_absolute model to the absolute abundance data. The resulting trajectory RMSEs were 0.109, 0.318, and 0.425, respectively, indicating that iLV achieved the lowest RMSE and provided the best agreement between estimated and observed trajectories, as shown in Fig 3A, 3B, and 3C.
We set according to Table 7. Panels A, B, and C show the trajectories generated by iLV, gLV_relative, and gLV_absolute models using all 16 pairs of data points, respectively, with observed data overlaid for comparison. iLV aligns closely with observed trajectories (RMSE = 0.109), whereas gLV_relative (RMSE = 0.318) and gLV_absolute (RMSE = 0.425) display higher errors. Panel D illustrates the training and validation RMSE distribution across 100 runs, showing iLV significantly outperforming the other methods in training and validation accuracies. Significance is computed relative using the Friedman Test followed by a post-hoc one-sided Wilcoxon signed-rank test (****: p < ; ***:p < 0.001; **: p < 0.01; *: p < 0.05; ns: not significant). https://doi.org/10.1371/journal.pcbi.1013691.g003 To further evaluate model performance, we randomly split the relative abundance dataset into 10 pairs (hare and lynx) of data points for parameter training and 6 pairs of data points for validation across 100 runs. The performance of iLV, gLV_relative, and gLV_absolute was compared using training and validation trajectory RMSE as shown in Fig 3D. The cLV model was excluded from this analysis, as it can only estimate and , where species D has the lowest observed variance of relative abundance among all species. Without explicit estimates for and , the trajectories of relative abundances cannot be recovered, making trajectory RMSE computation infeasible for cLV. The iLV model significantly outperformed gLV_relative and gLV_absolute in training RMSE, with p-values and , respectively (Friedman Test followed by a post-hoc one-sided Wilcoxon signed-rank test). No significant difference in training RMSE was observed between gLV_relative and gLV_absolute. For validation RMSE, iLV significantly outperformed gLV_relative and gLV_absolute, with p-values of and , respectively. No significant difference in validation RMSE was detected between gLV_relative and gLV_absolute. Overall, the iLV model consistently performed best among the three methods during training and validation, which underscores iLV’s reliability. A dataset containing the abundances of Stylonychia pustulata grown in mixture with P. caudatum on Osterhout medium was recorded in The Struggle for Existence [19] and digitized by Mühlbauer et al. [20]. Absolute abundances of these two species were recorded in a 25-day period, and we converted them to relative abundances. Day 2 was skipped, while individual values were missing and only the mean was reported for days 20–25 in the original data. Besides, the system started to plateau after day 19, so we modeled mean abundances (of three replicates) before day 19. The fitted results are given in Fig 4. We compared the performances of iLV, gLV_relative, and gLV_absolute, and iLV had the lowest RMSE (0.0673) compared to the other two models (with RMSE of 0.163 and 0.383, respectively). iLV also captured the system oscillation between day 0 and day 7, while the other two models could not. We set according to Table 7. Panels A, B, and C show the trajectories generated by iLV, gLV_relative, and gLV_absolute models using all 19 pairs of data points, respectively, with observed data overlaid for comparison. iLV aligns closely with observed trajectories (RMSE = 0.0673), whereas gLV_relative (RMSE = 0.163) and gLV_absolute (RMSE = 0.383) display higher errors. Panel D illustrates the training and validation RMSE distribution across 100 runs, showing iLV significantly outperforming the other methods in training accuracy and maintainin significantly better validation performance than gLV_absolute. Significance is computed relative using the Friedman Test followed by a post-hoc one-sided Wilcoxon signed-rank test (****: p < ; ***:p < 0.001; **: p < 0.01; *: p < 0.05; ns: not significant). https://doi.org/10.1371/journal.pcbi.1013691.g004 Similarly, we randomly split the whole dataset into 12 pairs of data points for training and 7 pairs of data points for validation across 100 runs, and the results are shown in Fig 4D. iLV performed significantly better than gLV_relative and gLV_absolute in training RMSE. As for validation, iLV performed significantly better than gLV_absolute, but there was no significant difference between iLV and gLV_relative in validation RMSE. A cheese microbial community Mounier et al. previously quantified cell counts of five microbial groups within a cheese microbial community over a 21-day period [15]. We converted these absolute counts to relative abundances for modeling purposes. These groups include D. hansenii, Y. lipolytica, G. candidum, Leucobacter sp., and a bacterial group composed of Arthrobacter arilaitensis, Hafnia alvei, Corynebacterium casei, Brevibacterium aurantiacum, and Staphylococcus xylosus. The system plateaued around day 10, and we modeled the dynamic phase between days 0 and 10. We applied our iLV model to this dataset with set to 0.001 (see Table 7), and the trajectory RMSE is 0.149. For comparison, we also applied the gLV_relative and gLV_absolute models, which yielded trajectory RMSEs of 0.327 and 0.296, respectively. Runtime and sensitivity to of the iLV Algorithm using real datasets. https://doi.org/10.1371/journal.pcbi.1013691.t007 Significant interactions among the microbial species, as predicted by the iLV model, are shown in Fig 5D. Some of these interactions are supported by previous studies [15]. For example, Y. lipolytica has been shown to reduce the viability of D. hansenii during the stationary phase [15]. However, not all predicted interactions align with prior findings. For instance, Y. lipolytica was reported to inhibit the mycelial growth of G. candidum, rather than promoting it. This inhibition disrupted the typical mold-like mycelial structure of G. candidum, leading to the formation of spaghetti-like morphologies [15]. It is important to note, however, that Mounier et al. investigated pairwise interactions through two-species co-culture experiments [15], which may not fully capture the complexity of interactions within a multispecies community. We set according to Table 7. Panels A, B, and C show the trajectories generated by iLV, gLV_relative, and gLV_absolute models, respectively, with observed data overlaid for comparison. iLV aligns closely with observed trajectories (RMSE = 0.149), whereas gLV_relative (RMSE = 0.327) and gLV_absolute (RMSE = 0.296) display higher errors. Panel D shows the predicted interaction network. The green arrows mean promotion, and the red arrows mean inhibition. The thickness of the arrows is proportional to the effect size of promotion or inhibition, indicated by the magnitude of bij. Only interactions that are at least 50 times larger than the minimum predicted interaction were plotted. Dh is D. hansenii, Yl is Y. lipolytica, Gc is G. candidum, Ls is Leucobacter sp., and C is a bacterial group composed of Arthrobacter arilaitensis, Hafnia alvei, Corynebacterium casei, Brevibacterium aurantiacum, and Staphylococcus xylosus. https://doi.org/10.1371/journal.pcbi.1013691.g005 Runtime and sensitivity to of the iLV Algorithm We tested the runtime of the iLV Algorithm in both real and simulated datasets under different values of (varying from 0.001 to 1000), and the results are shown in Tables 5 and 6, respectively. In the lynx-hare dataset, trajectory RMSE had minor fluctuations around 0.11 and achieved the minimum of 0.109 when . Similarly, trajectory RMSE stayed rather stable around 0.068 and achieved the minimum of 0.0673 when in the Stylonychia pustula-P. caudatum dataset. The cheese microbial dataset exhibited slightly higher fluctuations and achieved the minimum RMSE of 0.149 when . For the simulated datasets, generated using the two parameter settings applied in benchmarking with 10% Gaussian noise added, the trajectory RMSE remained consistent across different values of . Based on these results, we adopted in benchmarking analyses without loss of generality. Overall, the trajectory RMSE is not highly sensitive to the value of for small systems, but we noticed a small increase in sensitivity as the system becomes larger and more complicated. The iLV model’s computational bottleneck lies in the nonlinear optimization step (Subroutine 2), which uses leastsq() and least_squares() functions from the SciPy library [21]. The time complexities of the Levenberg–Marquardt algorithm (implemented in both leastsq() and least_squares(method = ’lm’)) and the Trust Region Reflective algorithm (least_squares(method = ’trf’)), both scale approximately with O( ) when the Jacobian of cost function F(x) is analytically solvable, where n is the number of parameters, m is the number of timepoints, and k is the number of iterations until convergence (depending on the problem, starting point, etc.) [22,23]. In our experiments, the runtime was acceptable for low-dimensional systems (see Tables 5 and 6). It generally increases with the number of species and the number of time points, while can also affect runtime by providing different starting points and thus affecting the number of iterations required for convergence. Discussion This study presents the iterative Lotka-Volterra (iLV) model as a novel approach for parameter estimation using relative abundance data, which integrates iterative refinement with non-linear optimization. The iLV model outperformed existing methodologies, such as the compositional Lotka-Volterra (cLV) model and generalized Lotka-Volterra (gLV_relative) approaches, in recovering interaction coefficients of simulation data under various noise levels and temporal resolutions. Moreover, compared with cLV and gLV_relative, iLV demonstrated more stable and reliable performance in capturing the relative interaction strengths across species under different magnitudes of self-interactions. Real-world applications, such as Canadian lynx-snowshoe hare, Stylonychia pustula-P. caudatum, and the cheese microbial community datasets, further validated the practicality of iLV. In these cases, iLV successfully reconstructed system trajectory with low RMSE values, demonstrating its ability to handle compositional data without sacrificing model interpretability. For instance, in the lynx-hare system, iLV accurately captured population dynamics, aligning closely with observed data (RMSE = 0.109). The iterative design of iLV is a significant improvement over traditional linear regression methods. By iteratively solving ordinary differential equations and refining parameter estimates, iLV minimizes the root mean square error (RMSE) between observed and predicted relative abundances. Despite its effectiveness, the iLV model’s computational bottleneck lies in the nonlinear optimization step (Subroutine 2), which uses leastsq() and least_squares() functions from the SciPy library [21]. These functions can be computationally intensive, particularly for systems with many species or dense time-series data. In our experiments, the runtime was acceptable for low-dimensional systems (see Tables 5 and 6), but challenges will occur for higher-dimensional systems. To address scalability concerns, several strategies can be considered in future work. One is the parallelization of key components, such as distributed optimization across multiple initializations [24]. Furthermore, using sparse Jacobians or approximate Hessians could potentially reduce computation without a major loss in accuracy, as has been shown in large-scale numerical optimization problems [25,26]. These approaches would enhance the scalability of iLV and facilitate its application to larger and more complex microbiome datasets, where compositional data with many taxa and high temporal resolution are increasingly common [10,11]. Another solution would be dimensionality reduction. For example, Phuongan et al. reduced more than 11,000 microbial Operational Taxonomic Units (OTUs) into 14 subgroups by clustering their blooming times, and they modeled on the subgroups instead of OTUs [27]. Besides, incorporating additional ecological constraints, such as carrying capacities or environmental perturbations, could improve the realism of the model. Moreover, an initial guess of the inter-species sum of absolute abundances is needed for iLV, and users may tweak it several times for better prediction accuracy. Sensitivity of is dataset-specific, and datasets involving larger and more complicated systems are more likely to be sensitive to in our experiments (see Tables 7 and 8). The number of data points also matters for iLV performance, and data interpolation may help with inadequate input data while also introducing additional errors. https://doi.org/10.1371/journal.pcbi.1013691.t008 In summary, the iterative Lotka-Volterra model offers a robust framework for analyzing compositional data. By addressing limitations in existing methodologies and demonstrating superior performance across simulated and real-world datasets, iLV sets a new approach for species interaction modeling. Future efforts to expand data dimensionality and ecological assumptions (such as perturbations) will further cement its utility in ecological and microbiome research. Methods The generalized Lotka-Volterra model, defined with relative abundances The Lotka-Volterra model was first proposed by Alfred James Lotka in 1910 for chemical reactions [28], and in 1925 he used the model to study predator-prey interactions [4]. In 1926, the same model was published by Vito Volterra [29]. While the Lotka-Volterra model only depicts systems of two species, the generalized Lotka-Volterra (gLV) was proposed to model systems of more than two species. The gLV model is widely used in describing the dynamics of a set of organisms, which takes the form of (where N1, …, Nm are absolute abundances of the different organisms in the system, and ri and bij are coefficients): (1) Relative abundances (x1, …, xm) and absolute abundances can be converted mutually if the sum of absolute abundances of all species in the system is given: (2)(3) According to equations (1)(2)(3), the gLV model can be rewritten with x1, …, xm and Nsum: (4) where the first sub-equation in (4) holds as the derivative of the sum equals the sum of derivatives and the rest of equations (4) holds as the following: We used odeint() function in scipy package [21] to solve the ordinary differential equation system (4) numerically when the initial values of x1, …, xm and Nsum are given. Parameter estimation by linear approximation and regression Multiple investigators have used the following linear approximations for the generalized Lotka-Volterra model when absolute abundances are given [6,8,15,30]: (5) where is the difference of at two adjacent time points and , and By linear approximation (5), is the intercept and are slopes of the linear regression where is the dependent variable and are independent variables. When absolute abundance data are available, parameters can be estimated as the least square solutions of such linear regressions. In benchmarking, we referred to such method as gLV_absolute. Since absolute abundances and relative abundances can be mutually converted when the sum of absolute abundances is known, we were inspired by the linear approximation method and came up with a novel iterative method for estimating parameters using the relative abundance presentation of the generalized Lotka-Volterra model (as indicated by equation system (4)). The iterative Lotka-Volterra (iLV) algorithm The iterative Lotka-Volterra (iLV) algorithm has two subroutines: (1) The iterative subroutine, which utilizes linear optimization, and (2) The least square estimation subroutine, which is a non-linear optimization using the results from the iterative subroutine as the starting point of optimization. When relative abundances of the i-th organism at different time points are given and we have an initial guess of , we propose the following iterative subroutine to improve parameter estimation to minimize the root-mean-square (RMSE) between estimated and observed relative abundances: https://doi.org/10.1371/journal.pcbi.1013691.t009 The motivations behind the iterative algorithm can be described as follows. The first step gives an initial estimate of the parameters ( , assuming constant total absolute abundance across time. The assumption of constant total absolute abundance across time is not correct in most situations and the total absolute abundance calculated based on the estimated parameters using the gLV model may be closer to the real values. Therefore, we re-estimate the parameters using the estimated total absolute abundance together with the observed relative abundance, resulting in a more accurate estimation of absolute abundances. We iterate this process multiple times and choose the parameters as the set having the lowest RMSE between the theoretical relative abundance and observed relative abundance. Note that if we multiply by a constant such that , the resulting estimate will not change, while the resulting interaction parameters will be changed to from equation (5) in the first step. Therefore, the estimated interaction coefficients will differ by a constant which will not impact the relative abundance across time [12]. We note that the iterative subroutine was based on intuition and intended to obtain an initial estimate of the parameters ( , , which can be used as the starting point for the following least square estimation. There is no guarantee that the estimates from the iterative subroutine converge to the real parameter values. Subroutine 2: the least square estimation subroutine We used least_squares() and leastsq() functions in scipy package [21] to find a local minimum of the cost function F(x), given initial guesses of the parameters: Given different initial guesses of the parameters, least_squares() and leastsq() functions may return different estimated parameters corresponding to different local minima of the cost function F(x). Therefore, a close-enough initial guess of parameters is crucial. We used the results returned by Subroutine 1 as the initial guesses for , , observed relative abundances at time point 0 as the initial guesses for . There are three optimization methods implemented in least_squares() function for finding a nearby local minimum, which are Trust Region Reflective (TRF) algorithm [22], Levenberg-Marquardt (LM) algorithm [23] and dogbox algorithm [31]. The dogbox algorithm is suitable for a small number of variables and may exhibit slow convergence when the rank of Jacobian is less than the number of variables [31], and it had poor performance and stability in our trials, so we only used Trust Region Reflective and Levenberg-Marquardt optimizations. We also included leastsq() function that has a slightly different implementation of Levenberg-Marquardt (LM) algorithm, as compared to least_squares() function. In our data analysis, we compared the trajectory RMSE of least_squares(method = “trf” or method = “lm”) and leastsq(), and returned the results with the lowest trajectory RMSE. Notably, while the growth coefficients in the gLV model are identifiable using relative abundance data, interaction coefficients are only identifiable up to a multiplicative constant [12]. Consequently, the interaction coefficients estimated by the iLV algorithm approximate the true interaction coefficients up to a constant. This property allows us to use the estimated parameters to compare the relative strengths of microbial interactions. Besides, for better performance of the iLV Algorithm, users have to tune that returned a local minimum of trajectory RMSE. Sensitivity to is dataset-specific, and datasets involving larger and more complicated systems are more likely to be sensitive to in our experiments (see Tables 5 and 6). We suggest that investigators carefully compare the trajectories of the relative abundances of the different microbes based on the gLV model under the estimated parameters with the observed relative abundances of the microbes. Their close approximation suggests that the estimated interactions are close to the true underlying interactions. Identifiability issue of self-interaction terms At each time point, only the relative abundances of the species are available, which satisfy the compositional constraint . Thus, the relative abundance of the first species can be represented as . This expression was substituted into the first step of Subroutine 1 in the iLV algorithm to obtain: (6) In equation (6), while and are identifiable, there are infinitely many solutions for and . Similar identifiability issues also arise in the cLV and gLV_relative models [13]. To overcome this issue, we assumed self-interaction terms to be 0 when modeling relative abundances with iLV, cLV, or gLV_relative. We also investigated how the presence of self-interactions impacted the estimation of (where i ≠ j) in our simulation studies. Evaluation metrics We used trajectory Root Mean Square Error (RMSE) to measure the performance of our algorithm (where m is the number of microbial species in the system and n + 1 is the number of sampled time points): Cosine similarities of bij between estimated and ground truth values were also used as an evaluation metric in benchmarking. We concatenated the estimated values of all bij (where i = 1,2,..,m, and j ≠ i) as vector A, and we concatenated the ground truth value of all bij vector B. Then, cosine similarity is the cosine of the angle between vector A and vector B.

webintern@dakdan.com