Database reverse engineering
Do you need to import CSV files into a database but no one gave you the entity–relationship model to set up the primary and foreign keys?
Say no more.
The algorithm works on the metadata about tables and columns, which are accessible over JDBC, and column and table statistics, which the database uses for the execution plan optimization. That means the algorithm is blazing fast as it does not look at the actual data.
Primary keys are identified by:
Once these features are collected, they are passed to logistic regression to estimate the probability that the column is in a primary key (compound PKs are supported). Since each table can have at most a single PK, the column with the highest probability in the table is declared to be the primary key of the table. If the estimated count of unique values in the estimated PK is smaller than the count of the rows in the table, it is a sign of a compound PK. In that case, columns are incrementally added into the PK by their descending probability of being in the PK, until the estimated PK unique row count ≥ estimated table row count. The estimated PK unique row count is optimistic and is defined as a product of the individual unique row counts in the PK. For example, if we have a table with 100 rows and the best candidate for a PK has only 11 unique rows, we will have to create a compound PK. The second best candidate for a PK has 13 unique values. Since 11*13 ≥ 100, we stop here and return a compound PK with these 2 columns.
Foreign keys are identified by:
Once again, probabilities are estimated with logistic regression. And the most likely FK constraints are returned (a compound FK is returned if the PK is compound).
CREATE STATISTICS mutagenesis.molecule
ANALYZE TABLE mutagenesis.molecule
EXEC DBMS_STATS.gather_table_stats('mutagenesis', 'molecule')
ANALYZE mutagenesis.molecule
If the schema quality is extremely low (e.g. all columns are typed as text and have irrelevant names), the PK and particularly FK estimates are going to be off. First, set correct data types and names to the columns. Then rerun Linkifier.
If you are using MySQL and get Access to data dictionary table 'mysql.table_stats' is rejected
then it is because MySQL, contrary to MariaDB, prevents access to internal tables. To be able to run Linkifier, start the db in the debug mode.
If you have problems to connect to MSSQL, make sure login with username-password combination is permitted and TCP/IP protocol is permitted. If you want to use Windows Authentication instead of username-password combination, a correct version of sqljdbc_auth.dll
(32/64bit) must be in the Java path. Possibly the easisest solution is to copy sqljdbc_auth.dll
from source directory lib\mssql\auth\
into bin directory in Java (e.g. C:\Program Files\Java\jre8\bin\
). Alternatively, start Linkifier from the command line with arguments like java -Djava.library.path=C:\path\to\sqljdbc_auth_directory -jar linkifier.jar
.
If you get an error from a database driver, the driver might be outdated. Replace the outdated driver with a new one, which is compatible with JDK 1.8
and your version of database.
If you get Table count: 0
, make sure that database name and schema name are both explicitly specified (in MySQL, you specify only database name).
If you get:
Error: Could not find or load main class controller.MainApp
Caused by: java.lang.NoClassDefFoundError: javafx/application/Application
on macOS with M1 processor or newer, you actually have to use OpenJDK:
./Downloads/jdk-20.jdk/Contents/Home/bin/java --module-path ./Downloads/javafx-sdk-20/lib --add-modules javafx.controls,javafx.fxml -jar ./Downloads/linkifier/linkifier-3.2.9.jar
If GUI still does not work, run Linkifier in commandline mode.
If you have any question or suggestion, let me know.
Foreign Key Constraint Identification in Relational Databases, Jan Motl and Pavel Kordík, 2017, ITAT 2017 Proceedings.
I would like to thank Aleš Fišer, Oliver Kerul’-Kmec, Jan Kukačka, Jiří Kukačka, Manuel Muñoz, George Papaloukopoulos and Batal Thibaut for their help. The code is using Simmetrics for text similarity calculations.