-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
156 lines (79 loc) · 3.15 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README_plots/"
)
```
# mixdir
The goal of mixdir is to cluster high dimensional categorical datasets.
It can
* handle missing data
* infer a reasonable number of latent class (try `mixdir(select_latent=TRUE)`)
* cluster datasets with more than 70,000 observations and 60 features
* propagate uncertainty and produce a soft clustering
A detailed description of the algorithm and the features of the package can
be found in the the accompanying [paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8631438&isnumber=8631391).
If you find the package useful please cite
>C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data",
2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.
## Installation
```{r installation, eval=FALSE, include=TRUE}
install.packages("mixdir")
# Or to get the latest version from github
devtools::install_github("const-ae/mixdir")
```
## Example
Clustering the [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) data set.
![](man/figures/README_plots/clustering_overview.png)
```{r example_load}
# Loading the library and the data
library(mixdir)
set.seed(1)
data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
```
Calling the clustering function `mixdir` on a subset of the data:
```{r}
# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000, 1:5], n_latent=3)
```
Analyzing the result
```{r example}
# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)
# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
labels_col = paste("Class", 1:3))
# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)
# The most predicitive features for each class
find_predictive_features(result, top_n=3)
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict(result, c(`cap-color`="yellow"))
# Note the most predictive features are different from the most typical ones
find_typical_features(result, top_n=3)
```
Dimensionality Reduction
```{r fig.width=8, fig.asp=0.31}
# Defining Features
def_feat <- find_defining_features(result, mushroom[1:1000, 1:5], n_features = 3)
print(def_feat)
# Plotting the most important features gives an immediate impression
# how the cluster differ
plot_features(def_feat$features, result$category_prob)
```
# Underlying Model
The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).
<div class="figure"><img src="man/figures/README_plots/equations_model.png" align="center" style="height: 150px" ></div>
![](man/figures/README_plots/model_plate_notation.png)