Back to Top

Symbolic regression is a function discovery approach for analysis and modeling of numeric multivariate data sets for a purpose of getting insights about data-generating systems. As the name suggests, the method by which the insights are to be generated is regression.

Symbolic regression, or symbolic function identification, seeks to discover many symbolic expressions of functions that fit the given data set. In more detail, the task of regression is to identify the variables (inputs) in the data that are related to the changes in the important control variables (outputs), to express these relationships in mathematical models, and to analyze the quality and generality of the constructed models.

If a researcher possesses domain knowledge and intuition for the appropriate input variables and the appropriate form of the functional relationship between inputs and the output, then classical regression techniques will efficiently optimize the parameters in the assumed model (e.g., by using ordinary or generalized least squares method for the given model structure). When the domain knowledge about the data generating system is limited, it is the task and the responsibility of the researcher to prune the data variables to an uncorrelated subset and guess the right model form. Symbolic regression, as opposed to regression techniques, does it for the researcher, and discovers both the form of the model and its parameters.

Symbolic regression proceeds with model building by first, asking the researcher to select a set of primitive functional operators allowed in the mathematical models, second, by applying an evolutionary algorithm to evolve both model structures and model parameters, and third, by scrutinizing modeling results to identify the driving input variables, and to select the final ensemble of models.

For more information, check Frequently Asked Questions about Symbolic Regression.