filling package

filling.data_generator class

class filling.data_generator.DataGenerator(tables, num_rows=10, predefined_values=None, column_type_mappings=None, num_rows_per_table=None, max_attepts_to_generate_value=50, guess_column_type_mappings=False, threshold_for_guessing=0.8)

Bases: object

Intelligent Data Generator for Automated Synthetic Database Population.

assign_foreign_keys(table, row)

Auto-assign foreign key values from the parent table data.

build_foreign_key_map()

Construct a mapping of foreign key relationships between parent and child tables.

Return type:

dict

compute_table_levels()

Return a dict like {level_number: [tableA, tableB, …], …} that groups tables by ‘level’ in the foreign-key dependency graph.

Return type:

dict

enforce_check_constraints(table, row)

Repeatedly generate new candidate values until all CHECK constraints pass.

enforce_constraints()

Enforce NOT NULL, CHECK, and UNIQUE constraints across all tables, in parallel by table level.

enforce_not_null_constraints(table, row)

Ensure columns with ‘NOT NULL’ are populated.

enforce_unique_constraints(table, row)

Ensure row satisfies any UNIQUE constraints for the given table.

export_as_sql_insert_query(max_rows_per_insert=1000)

Export generated data as SQL INSERT queries.

Splits the rows into chunks (up to max_rows_per_insert per query) to avoid exceeding database limits on single inserts.

Parameters:

max_rows_per_insert (int, optional) – Maximum number of rows per INSERT statement (default is 1000).

Returns:

A string containing SQL INSERT statements for all populated tables.

Return type:

str

export_data_files(output_dir, file_type='SQL')

Export generated data to files in the specified format.

Exports data for each table as CSV or JSON files, or as a single SQL file containing INSERT statements. The export is performed sequentially.

Parameters:
  • output_dir (str) – Directory where the exported files will be saved.

  • file_type (str, optional) – The format to export data (‘SQL’, ‘CSV’, or ‘JSON’). Default is ‘SQL’.

Return type:

None

fill_remaining_columns(table, row)

For each column not yet assigned, generate a synthetic value.

generate_column_value(table, column, row, constraints=None)

Generate a synthetic value for ‘column’ in ‘table’ subject to constraints.

generate_composite_primary_keys(table, row_count)

Generate composite PK values by combining columns in a Cartesian product.

generate_data()

Generate synthetic data for all tables.

This is the main entry point for data generation. It first generates initial data for all tables, then enforces constraints (NOT NULL, CHECK, UNIQUE), and optionally runs a repair process to remove rows violating constraints. Finally, it prints statistics if requested.

Parameters:
  • run_repair (bool, optional) – If True, attempt to repair generated data to remove constraint violations.

  • print_stats (bool, optional) – If True, print data generation statistics.

Returns:

A dictionary mapping table names to lists of generated row dictionaries.

Return type:

dict

generate_initial_data()

Generate initial data in parallel groups by table level.

generate_primary_keys(table, row_count)

Generate single-column primary keys for ‘row_count’ rows.

generate_value_based_on_conditions(row, column, conditions)

Generate a candidate that satisfies the given conditions for this column.

generate_value_based_on_type(col_type)

Fallback generator for a column based on general type inference.

get_all_column_names()

Retrieve a comprehensive list of all column names across all tables.

Return type:

list

get_column_info(table, col_name)

Retrieve a column’s info (cached).

Return type:

dict

initialize_primary_keys()

Initialize primary key counters for each table to ensure unique ID generation.

is_foreign_key_column(table, col_name)

Check if a column is part of a foreign key definition in ‘table’.

Return type:

bool

preview_inferred_mappings(num_preview=10)

Print a preview of the inferred column mappings.

Generates a small sample (num_preview rows) of data for each table using the guessed column type mappings. This preview helps to visually inspect whether the mappings produce appropriate values.

Parameters:

num_preview (int, optional) – Number of sample rows to generate per table (default is 10).

Return type:

None

process_row(table, row)

Fill out a row: assign FKs, fill columns, ensure NOT NULL, check constraints.

Return type:

dict

resolve_table_order()

Determine the order for processing tables based on foreign key dependencies. Ensures parent tables come before child tables.

Return type:

list

filling.check_constraint_evaluator module

class filling.check_constraint_evaluator.CheckConstraintEvaluator(schema_columns=None)

SQL CHECK Constraint Evaluator for Data Validation.

This class parses and evaluates SQL CHECK constraints against row data. It supports functions like EXTRACT and DATE, various operators (including BETWEEN), and provides helper functions for operand unification.

apply_operator(left, operator, right)

Apply a binary operator to left and right operands.

Supports standard arithmetic, comparison, logical, and SQL-specific operators (LIKE, IN, BETWEEN, etc.). If the operator condition is not met, a candidate value is proposed for the left operand.

Parameters:
  • left (any) – The left operand.

  • operator (str) – The operator as a string.

  • right (any) – The right operand.

Returns:

(result, candidate) where result is a boolean indicating whether the condition is satisfied, and candidate is a proposed value for the left operand if not.

Return type:

tuple

convert_sql_expr_to_python(parsed_expr, row)

Convert a parsed SQL expression into a Python expression string.

Recursively converts literals, identifiers, and function calls from the parsed SQL expression into a corresponding Python expression string for evaluation.

Parameters:
  • parsed_expr (any) – The parsed SQL expression.

  • row (dict) – The row of data used for context during conversion.

Returns:

The resulting Python expression string.

Return type:

str

date_func(arg)

Simulate the SQL DATE function.

Converts the input argument to a date object. If the argument is a string, it is parsed using the ‘%Y-%m-%d’ format.

Parameters:

arg (any) – The value to convert to a date.

Returns:

The resulting date object.

Return type:

date

evaluate(check_expression, row)

Evaluate a SQL CHECK constraint against a row of data.

Parses the given CHECK constraint and evaluates it against the provided row. Returns a tuple (result, candidate) where result is a boolean indicating whether the constraint is satisfied and candidate is a proposed value if the constraint fails.

Parameters:
  • check_expression (str) – The SQL CHECK constraint expression.

  • row (dict) – A dictionary representing a row of data.

Returns:

(result, candidate) where result is a boolean and candidate is a proposed adjustment if needed.

Return type:

tuple

extract(field, source)

Simulate the SQL EXTRACT function.

Extracts a specific component (year, month, or day) from a given date or datetime source.

Parameters:
  • field (str) – The field to extract (e.g., ‘year’, ‘month’, ‘day’).

  • source (any) – The source date or datetime (or string representing a date).

Returns:

The extracted numeric value.

Return type:

int

extract_conditions(check_expression)

Extract conditions from a SQL CHECK constraint.

Parses the given CHECK constraint expression and returns a dictionary mapping column names to lists of condition dictionaries (each containing an operator and a value).

Parameters:

check_expression (str) – The SQL CHECK constraint expression.

Returns:

A dictionary mapping column names to lists of condition dictionaries.

Return type:

dict

handle_operator(parsed_expr, row)

Convert an operator expression into a Python expression string.

Maps SQL operators (e.g., ‘=’, ‘<>’, ‘LIKE’, ‘IN’) to their Python equivalents and constructs a corresponding expression string.

Parameters:
  • parsed_expr (ParseResults) – The parsed operator expression.

  • row (dict) – The row of data used for context.

Returns:

A Python expression string representing the operator expression.

Return type:

str

like(value, pattern)

Simulate the SQL LIKE operator.

Converts the SQL LIKE pattern (using ‘%’ and ‘_’) into a regular expression and checks whether the given value matches the pattern.

Parameters:
  • value (str) – The value to test.

  • pattern (str) – The SQL LIKE pattern.

Returns:

True if the value matches the pattern, otherwise False.

Return type:

bool

not_like(value, pattern)

Simulate the SQL NOT LIKE operator.

Returns the negation of the LIKE operator evaluation for the given value and pattern.

Parameters:
  • value (str) – The value to test.

  • pattern (str) – The SQL LIKE pattern.

Returns:

True if the value does not match the pattern, otherwise False.

Return type:

bool

regexp_like(value, pattern)

Simulate the SQL REGEXP_LIKE function.

Checks whether the given value matches the specified regular expression pattern. If the value does not match, returns a candidate value that may satisfy the pattern.

Parameters:
  • value (str) – The value to test.

  • pattern (str) – The regular expression pattern (optionally quoted).

Returns:

(result, candidate) where result is a boolean indicating a match and candidate is a proposed value if no match is found.

Return type:

tuple

unify_operands(left, right)

Coerce both operands to a common type if possible.

Attempts to convert both operands to date or numeric types. If both can be converted, returns the converted values; otherwise, returns the original operands.

Parameters:
  • left (any) – The left operand.

  • right (any) – The right operand.

Returns:

A tuple (left, right) with operands coerced to common types if possible.

Return type:

tuple

filling.helpers module

filling.helpers.extract_allowed_values(constraints, col_name)

Extract allowed values from constraints related to a specific column.

This function parses a list of SQL constraint expressions to identify any IN clauses that define a set of permissible values for a specified column. It extracts and returns these allowed values.

Parameters:
  • constraints (list) – A list of SQL constraint expressions as strings.

  • col_name (str) – The name of the column from which to extract allowed values.

Returns:

A list of allowed values specified in the IN clauses for the given column. For example: [‘Fiction’, ‘Non-fiction’, ‘Science’]

Return type:

list of str

filling.helpers.extract_numeric_ranges(constraints, col_name)

Extract numeric ranges from constraints related to a specific column.

This function parses a list of SQL constraint expressions to identify and extract numeric range conditions (e.g., >=, <=, >, <, =, BETWEEN) applied to a specified column. It returns these conditions as a list of tuples, where each tuple contains the operator and the corresponding numeric value.

Parameters:
  • constraints (list) – A list of SQL constraint expressions as strings.

  • col_name (str) – The name of the column from which to extract numeric range constraints.

Returns:

A list where each tuple consists of an operator (str) and a numeric value (float). For example: [(‘>=’, 1.0), (‘<=’, 10.0)]

Return type:

list of tuple

filling.helpers.extract_regex_pattern(constraints, col_name)

Extract regex patterns from constraints related to a specific column.

This function scans through a list of SQL constraint expressions to identify any REGEXP_LIKE conditions applied to a specified column. It extracts and returns the regex patterns used in these constraints.

Parameters:
  • constraints (list) – A list of SQL constraint expressions as strings.

  • col_name (str) – The name of the column from which to extract regex patterns.

Returns:

A list of regex patterns found in the constraints for the specified column. For example: [‘^d{13}$’, ‘^[w.-]+@[w.-]+.w{2,}$’]

Return type:

list of str

filling.helpers.generate_numeric_value(ranges, col_type)

Generate a numeric value based on specified ranges and column type.

This function takes a list of numeric range conditions and the data type of the column to generate a random numeric value that satisfies all specified constraints. It intelligently determines the appropriate range for value generation based on the operators provided.

Parameters:
  • ranges (list of tuple) – A list of tuples where each tuple contains an operator (str) and a numeric value (float). Example: [(‘>=’, 1.0), (‘<=’, 10.0)]

  • col_type (str) – The SQL data type of the column (e.g., ‘INT’, ‘DECIMAL’, ‘NUMERIC’).

Returns:

A randomly generated numeric value within the specified range. The type of the returned value matches the column type: - Returns an int if the column type is integer-based. - Returns a float if the column type is decimal-based.

Return type:

int or float

filling.helpers.generate_value_matching_regex(pattern)

Generate a value that matches a specified regex pattern.

This function utilizes the exrex library to generate a random string that conforms to the provided regular expression pattern. It handles escape sequences and ensures that the generated value is valid according to the regex constraints.

Parameters:

pattern (str) – The regex pattern that the generated string must match. For example, ‘^d{13}$’ for a 13-digit ISBN.

Returns:

A randomly generated string that matches the given regex pattern. If the pattern is invalid or no matching string can be generated, an empty string is returned.

Return type:

str