FORGE: A Hybrid Framework Bridging Foundational and Non-Foundational Models for Forecasting
How Combining Foundational and Non-Foundational Models Can Take Forecasting Further
This article was co-authored with Rafael Guedes and Sebastião Lessa.Introduction
Foundation models for time series forecasting have been one of the hot topics in AI this year. A foundation model is a large-scale and general-purpose model, meaning it does not need any task-specific training or data (we can fine-tune them, but we don’t need to). Thus, we can apply a foundation model in a zero-shot inference setup, where from a small chunk of contextual information, for example, a small window of a new time series, we can predict a forecasting horizon. Notice that this process involves no training, only inference.
Several companies entered this race, producing competitive foundation models. Although these models have the common goal of generating accurate forecasts without any necessary training, they were developed with different strategies and characteristics. TimeGPT [1] was one of the first foundation models developed for forecasting by Nixtla. Following it, other companies entered the race. They developed new models, such as MOIRAI [2] from Salesforce, Lag-Llama [3] from Morgan Stanley, ServiceNow, and a group of Canadian Universities, TimesFM [4] from Google and Chronos [5] by Amazon.
These models have different architectures and capabilities, namely TimeGPT and Moirai, which are multivariate and can use external information to generate predictions. In most setups, these two capabilities are relevant since you can model dependencies between time series and also incorporate relevant features that might help to explain the target variable behavior. For example, if you are in a retail context, information about marketing activities can be crucial to understanding the sales patterns of a specific product. On the other hand, Chronos and TimesFM are strictly univariate foundation models and cannot use external information to produce better forecasts.
In this article, we propose a hybrid framework called FORGE (Fusion of Refining and Generalized Engines) that addresses this limitation by combining Chronos with a multivariate non-foundational model. The goal is to first create a base forecast using Chronos — taking advantage of its strong zero-shot inference performance — and then refine it with a multivariate model that can incorporate external information and capture dependencies across series. The article covers all the theoretical details, as well as the practical implementation of the FORGE approach.
As always, the code is available on our GitHub.
Chronos
Chronos [5] is Amazon’s most recent foundation model for time series forecasting, which consists of a probabilistic model based on T5 (Text-to-Text Transfer Transformer) architecture [6] to forecast future patterns. T5 is a Large Language Model (LLM) specialized in different Natural Language Processing (NLP) tasks such as classification, summarization, or text generation. Therefore, how can it be adapted to handle continuous numerical data and generate accurate forecasts?
As illustrated in Figure 2, the forecasting team at Amazon managed to adapt any LLM into a forecasting model by focusing on two main concepts:
- Scaling is responsible for mapping the input data into a proper range of values to be used in the quantization step. Contrary to its usual goal of facilitating the optimization of deep learning models, scaling helps create input tokens, which is the format expected by any LLM.
- Quantization, on the other hand, converts the scaled continuous values into 4096 discrete tokens through binning. The authors used uniform binning, which groups all values within a specific range to the same bin or, in other words, to the same token.
Another key feature of Chronos is its zero-shot inference capability. The model can accurately generate zero-shot forecasts due to its intensive training on 796,000 different series over different domains such as energy, transport, healthcare, retail, web, weather, and finance.
Nevertheless, the model also has some disadvantages, such as:
- Univariate local model, which makes it impossible to model dependencies between series.
- Incapacity to process external information, such as static and dynamic covariates, such as product brand or product price.
- It cannot forecast beyond the values on training data because its loss function is cross entropy, which means that it can only predict values present in the training data and cannot accurately forecast time series with aggressive up or downward trends.
TiDE
TiDE, or Time-series Dense Encoder, was developed by researchers at Google to tackle long-term forecasting by employing an MLP encoder-decoder model that surpasses complex Transformer architectures [6]. This model efficiently handles both static and dynamic covariates, like product brand and price, making it ideal for different industries such as transportation, energy, or retail.
TiDE’s architecture, shown in Figure 3, begins with a feature projection layer that reduces the dimensionality of the dynamic covariates. A dense encoder then processes this projected data, along with static covariates and historical inputs, to form a dense feature embedding. This embedding is passed to the dense decoder to produce initial predictions, which the temporal decoder refines by integrating temporal patterns and dependencies. The temporal decoder also uses a residual connection, leveraging projections from past data to further improve forecast accuracy.
Though TiDE is simple and powerful, it is sensitive to its hyperparameters and requires careful tuning to handle complex real-world datasets effectively.
FORGE: combining foundational and non-foundational models
As noted earlier, Chronos can perform zero-shot forecasts. Nonetheless, it has three main limitations: an inability to model dependencies between series, no flexibility for incorporating external (and potentially critical) information, and a constraint to producing predictions only within the training range. To address these issues, we propose a hybrid model that combines Chronos’s forecasts with the capabilities of any multivariate model that can handle external data, capture series dependencies, and forecast beyond the values seen in training.
FORGE has two components:
- Base forecast is the forecast produced by Chronos based on historical time series data.
- Residuals forecast is the forecast produced by a multivariate model, in our case TiDE, based on the global (for all time series) historical residuals produced by Chronos but also on static and dynamic covariates.
The training process of TiDE consists of adding to the original dataset a new target variable called residuals based on the difference between the forecast produced by Chronos and the actual values. We use an expanding window to generate this new target variable, as shown in Figure 4.
TAs shown in Figure 5, once the residuals have been produced, we can generate the base forecast using Chronos and the residuals forecast using TiDE. These two forecasts are then combined to create the final and enhanced forecast based on historical data, dynamic and static covariates, and dependencies between series.
FORGE in action
In this section, we will use FORGE to forecast tourism visitors to Australia using a real-world dataset that is publicly available under the cc-by-4.0 license. Subsequently, we compare the forecasting performance of FORGE against Chronos and TiDE individually.
We enhanced the dataset with economic covariates (e.g., CPI, Inflation Rate, GDP) extracted from Trading Economics, which uses economic indicators based on official sources. We also perform some preprocessing to increase the usability of the dataset further. The final structure of the dataset is the following:
- Unique ID: A combination of encoded names for States, Zones, Regions within Australia, and the purpose of the visit (e.g., business, holiday, visiting, other).
- Time: Represents the time dimension of the dataset, dynamically adjusted for each series.
- Target: The target variable for forecasting, specifically focusing on visits.
- Dynamic Covariates: Economic indicators such as CPI, Inflation Rate, and GDP that vary over time.
- Static Covariates (Static_1 to Static_4): Extracted from the unique ID, these provide additional information for analysis, including geographic and purpose-of-visit details.
We stored the new dataset version here so that our experiments can be easily reproduced.
In addition to the final dataset, the residuals dataset was also created to train our multivariate model, TiDE. It has the same structure, but instead of the target, it contains the residuals: the difference between the Chronos predicted target and the actual target.
Moving to the practical implementation, we start by importing the libraries and setting the global variables:
from darts.dataprocessing.transformers import StaticCovariatesTransformer from darts.utils.likelihood_models import QuantileRegression from darts.dataprocessing.transformers import Scaler from darts.dataprocessing.pipeline import Pipeline from dateutil.relativedelta import relativedelta from chronos import ChronosPipeline from darts.models import TiDEModel from darts import TimeSeries import pandas as pd import torch import utils DATASET = "hf://datasets/zaai-ai/time_series_datasets/data.csv" RESIDUALS_DATASET = "hf://datasets/zaai-ai/time_series_dataset_residuals/residuals.csv" TIME_COL = "Date" TARGET = "visits" RESIDUALS_TARGET = "residuals" STATIC_COV = ["static_1", "static_2", "static_3", "static_4"] DYNAMIC_COV = ["CPI", "Inflation_Rate", "GDP"] FORECAST_HORIZON = 6 # months FREQ = "MS" # Tuned TiDE hyper parameters for forecast prediction # RMSE: 134.74 TiDE_params= { 'input_chunk_length': 12, 'output_chunk_length': FORECAST_HORIZON, 'num_encoder_layers': 4, 'num_decoder_layers': 8, 'decoder_output_dim': 8, 'hidden_size': 8, 'temporal_width_past': 16, 'temporal_width_future': 8, 'temporal_decoder_hidden': 128, 'dropout': 0.2, 'batch_size': 16, 'n_epochs': 20, 'likelihood': QuantileRegression(quantiles=[0.25, 0.5, 0.75]), 'random_state': 42, 'use_static_covariates': True, 'optimizer_kwargs': {'lr': 0.001}, 'use_reversible_instance_norm': False } # Tuned TiDE hyper parameters for residuals prediction # RMSE: 133.85 res_TiDE_params= { 'input_chunk_length': 11, 'output_chunk_length': FORECAST_HORIZON, 'num_encoder_layers': 5, 'num_decoder_layers': 10, 'decoder_output_dim': 15, 'hidden_size': 5, 'temporal_width_past': 4, 'temporal_width_future': 10, 'temporal_decoder_hidden': 20, 'dropout': 0.15, 'batch_size': 128, 'n_epochs': 30, 'likelihood': QuantileRegression(quantiles=[0.25, 0.5, 0.75]), 'random_state': 42, 'use_static_covariates': True, 'optimizer_kwargs': {'lr': 0.001}, 'use_reversible_instance_norm': True, }
After that, we load our datasets:
df = pd.read_csv(DATASET).drop(columns=["Unnamed: 0"]) df[TIME_COL] = pd.to_datetime(df[TIME_COL]) residuals = pd.read_csv(RESIDUALS_DATASET) residuals[TIME_COL] = pd.to_datetime(residuals[TIME_COL]) print(f"Distinct number of time series: {len(df['unique_id'].unique())}")
Distinct number of time series: 304
To better test FORGE, we created four folds. Each fold’s training set contains all data up to the start date of the fold, while the testing set comprises the subsequent 6 months of data. The last fold testing set ends on the most recent date in the dataset.
These folds provide a more robust measure of model performance over time. We are not relying on a single window to evaluate our approach.
Here is how this was implemented:
folds = [] for i in range(4, 0, -1): start_date = end_date - relativedelta(months=FORECAST_HORIZON*i) end_date_fold = start_date + relativedelta(months=FORECAST_HORIZON) fold = { "start_date": start_date, "end_date": end_date_fold, "train": None, "test": None, "df": None, "predictions": { "chronos": None, "tide": None, "hybrid": None } } fold["train"] = df[df[TIME_COL] <= fold["start_date"]] fold["test"] = df[(df[TIME_COL] > fold["start_date"]) & (df[TIME_COL] <= fold["end_date"])] fold["df"] = df[df[TIME_COL] <= fold["end_date"]] folds.append(fold) for i,fold in enumerate(folds): print(f"Fold {i+1}:") print(f"Months for training: {len(fold['train'][TIME_COL].unique())} from {min(fold['train'][TIME_COL]).date()} to {max(fold['train'][TIME_COL]).date()}") print(f"Months for testing: {len(fold['test'][TIME_COL].unique())} from {min(fold['test'][TIME_COL]).date()} to {max(fold['test'][TIME_COL]).date()}\\n")
Fold 1: Months for training: 204 from 1998–01–01 to 2014–12–01 Months for testing: 6 from 2015–01–01 to 2015–06–01
Fold 2: Months for training: 210 from 1998–01–01 to 2015–06–01 Months for testing: 6 from 2015–07–01 to 2015–12–01
Fold 3: Months for training: 216 from 1998–01–01 to 2015–12–01 Months for testing: 6 from 2016–01–01 to 2016–06–01
Fold 4: Months for training: 222 from 1998–01–01 to 2016–06–01 Months for testing: 6 from 2016–07–01 to 2016–12–01
To implement FORGE, we first generate the forecast for each fold using Chronos.
for i,fold in enumerate(folds): train_darts = TimeSeries.from_group_dataframe( df=fold['train'], group_cols=STATIC_COV, time_col=TIME_COL, value_cols=TARGET, freq=FREQ, fill_missing_dates=True, fillna_value=0) # Load the Chronos pipeline pipeline = ChronosPipeline.from_pretrained( CHRONOS_ARCHITECTURE[0], device_map=CHRONOS_ARCHITECTURE[1], torch_dtype=torch.bfloat16) forecast = [] for ts in train_darts: # Forecast lower, mid, upper = utils.chronos_forecast(pipeline, ts.pd_dataframe().reset_index(), FORECAST_HORIZON, TARGET) unique_id = "".join(str(ts.static_covariates[key].item()) for key in ts.static_covariates) forecast.append(utils.convert_forecast_to_pandas([lower, mid, upper], fold['test'][fold['test']['unique_id'] == unique_id])) fold['predictions']['chronos'] = pd.concat(forecast)
After that, for each fold, we generate the residuals forecast using TiDE, and we sum it to the Chronos forecast to produce the final forecast:
for i,fold in enumerate(folds): residuals_train = residuals[residuals[TIME_COL] <= fold['start_date']] residuals_df = residuals[(residuals[TIME_COL] <= fold['end_date'])] residuals_darts = TimeSeries.from_group_dataframe( df=residuals_train, group_cols=STATIC_COV, time_col=TIME_COL, value_cols=RESIDUALS_TARGET, freq=FREQ, fill_missing_dates=True, fillna_value=0) dynamic_covariates = utils.create_dynamic_covariates(residuals_darts,residuals_df,FORECAST_HORIZON, DYNAMIC_COV) # scale covariates dynamic_covariates_transformed = SCALER.fit_transform(dynamic_covariates) # scale data and transform static covariates data_transformed = PIPELINE.fit_transform(residuals_darts) residuals_tide = TiDEModel(**res_TiDE_params) residuals_tide.fit(data_transformed, future_covariates=dynamic_covariates_transformed, verbose=False) pred = PIPELINE.inverse_transform(residuals_tide.predict(n=FORECAST_HORIZON, series=data_transformed, future_covariates=dynamic_covariates_transformed, num_samples=50)) residuals_forecast = utils.transform_predictions_to_pandas(pred, RESIDUALS_TARGET, residuals_darts, [0.25, 0.5, 0.75], convert=False) fold['predictions']['hybrid'] = utils.combine_predictions(fold['predictions']['chronos'], residuals_forecast)
Finally, we generate the forecast with TiDE on the original dataset with visits as a target to compare it against Chronos and FORGE.
for i,fold in enumerate(folds): train_darts = TimeSeries.from_group_dataframe( df=fold['train'], group_cols=STATIC_COV, time_col=TIME_COL, value_cols=TARGET, freq=FREQ, fill_missing_dates=True, fillna_value=0) dynamic_covariates = utils.create_dynamic_covariates(train_darts, fold['df'],FORECAST_HORIZON, DYNAMIC_COV) # scale covariates dynamic_covariates_transformed = SCALER.fit_transform(dynamic_covariates) # scale data and transform static +covariates data_transformed = PIPELINE.fit_transform(train_darts) tide = TiDEModel(**TiDE_params) tide.fit(data_transformed, future_covariates=dynamic_covariates_transformed, verbose=False) pred = PIPELINE.inverse_transform(tide.predict(n=FORECAST_HORIZON, series=data_transformed, future_covariates=dynamic_covariates_transformed, num_samples=50)) fold['predictions']['tide'] = utils.transform_predictions_to_pandas(pred, TARGET, train_darts, [0.25, 0.5, 0.75])
Once all the predictions have been generated, we can compare the performance of FORGE against TiDE and Chronos across all folds. For that, we calculate the MAPE of each model for each of the months in the forecast horizon, and we take the average across all folds. Also, we decided to use MAPE because it is more intuitive and easier to interpret, as it expresses the error as a percentage, making it straightforward to understand the relative magnitude of forecasting errors.
We selected the top 25 unique IDs with the highest volume of visits for better evaluation. This approach ensures that low-visit counts, which could cause the MAPE to be disproportionately high, do not distort the results.
Figure 6 shows that FORGE consistently outperforms both TiDE and Chronos in longer-term predictions, with the only exceptions occurring during the first two months. This pattern indicates that FORGE’s advantages become more pronounced over time. We argue that these benefits would be even greater for datasets with external features more closely correlated to the target variable (where TiDE excels) or for shorter, more erratic series (where Chronos excels).
Conclusion
In this article, we propose FORGE, a framework to incorporate foundational and non-foundational time series forecasting models. In this case, we showed how we could use FORGE to combine Chronos and TiDE, using their strengths to improve our predictions.
FORGE showed promising results, outperforming the individual models in longer-term predictions. The results should be even more pronounced with datasets where external features correlate more with the target variable (where TiDE shines) or when the series are shorter and more erratic (where Chronos shines).
Overall, FORGE represents a step forward in combining foundational and non-foundational approaches, offering a more robust solution for real-world applications.
About me
Serial entrepreneur and leader in the AI space. I develop AI products for businesses and invest in AI-focused startups. Founder @ ZAAI | LinkedIn | X/Twitter
References
[1] Garza, A., & Mergenthaler-Canseco, M. (2023). TimeGPT-1. arXiv. https://arxiv.org/abs/2310.03589
[2] Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., & Sahoo, D. (2024). Unified Training of Universal Time Series Forecasting Transformers. arXiv. https://arxiv.org/abs/2402.02592
[3] Rasul, K., Ashok, A., Williams, A. R., Ghonia, H., Bhagwatkar, R., Khorasani, A., Darvishi Bayazi, M. J., Adamopoulos, G., Riachi, R., Hassen, N., Biloš, M., Garg, S., Schneider, A., Chapados, N., Drouin, A., Zantedeschi, V., Nevmyvaka, Y., & Rish, I. (2024). Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting. arXiv. https://arxiv.org/abs/2310.08278
[4] Das, A., Kong, W., Sen, R., & Zhou, Y. (2024). A decoder-only foundation model for time-series forecasting. arXiv. https://arxiv.org/abs/2310.10688
[5] Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, Yuyang Wang. (2024). Chronos: Learning the Language of Time Series. arXiv: https://arxiv.org/pdf/2403.07815.
[6] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv: https://arxiv.org/abs/1910.10683.
[7] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, Rose Yu. (2024). Long-term Forecasting with TiDE: Time-series Dense Encoder. arXiv: https://arxiv.org/pdf/2304.08424.
All images are by the authors unless noted otherwise.