blink - Record Linkage for Empirically Motivated Priors
An implementation of the model in Steorts (2015) <DOI:10.1214/15-BA965SI>, which performs Bayesian entity resolution for categorical and text data, for any distance function defined by the user. In addition, the precision and recall are in the package to allow one to compare to any other comparable method such as logistic regression, Bayesian additive regression trees (BART), or random forests. The experiments are reproducible and illustrated using a simple vignette. LICENSE: GPL-3 + file license.
Last updated 1 years ago
5.72 score 5 stars 1 dependents 70 scripts 288 downloadscd - CD Data for Entity Resolution
Duplicated music data (pre-processed and formatted) for entity resolution. The total size of the data set is 9763. There are respective gold standard records that are labeled and can be considered as a unique identifier.
Last updated 7 years ago
4.16 score 29 scripts 222 downloadsklsh - Blocking for Record Linkage
An implementation of the blocking algorithm KLSH in Steorts, Ventura, Sadinle, Fienberg (2014) <DOI:10.1007/978-3-319-11257-2_20>, which is a k-means variant of locality sensitive hashing. The method is illustrated with examples and a vignette.
Last updated 4 years ago
3.70 score 3 scripts 176 downloadscora - Cora Data for Entity Resolution
Duplicated publication data (pre-processed and formatted) for entity resolution. This data set contains a total of 1879 records. The following variables are included in the data set: id, title, book title, authors, address, date, year, editor, journal, volume, pages, publisher, institution, type, tech, note. The data set has a respective gold data set that provides information on which records match based on id.
Last updated 5 years ago
3.35 score 3 stars 15 scripts 173 downloadsrestaurant - Restaurant Data for Entity Resolution
Duplicated restaurant data (pre-processed and formatted) for entity resolution. This package contains formatted data from a data set that contains information about different restaurants, with the Zagats portion containing 331 records and the Fodors portion containing 533 records. The following variables are included in the data set: id, name, address, city, phone, type. The data set has a respective gold data set that provides information on which records match based on id.
Last updated 7 years ago
2.00 score 1 stars 162 downloads