Cluster wrapper function
cluster(data, ..., n_clusters, minimum_term_frequency = 3, min_terms = 3, num_terms = 10, stopwords = NULL, remove_twitter = FALSE)
data | The data frame comparing the text vector as the first column |
---|---|
... | Additional columns of the data frame containing metadata cfor comparison |
n_clusters | The number of clusters to be used for the clustering solution |
minimum_term_frequency | The minimum number of occurences for a term to be included |
min_terms | The minimum number of terms for a document to be included |
num_terms | Number of terms to display in clustering summary output |
stopwords | Additional stopwords to exclude from clustering analysis |
remove_twitter | Whether to remove text associated with Twitter content, useful for when analyzing data from this source (defaults to FALSE) |
Performs the clustering half of the process, including assembling and cleaning the corpus, deviationalizing and clustering.
library(clustRcompaR) library(dplyr) library(quanteda) d <- inaugural_addresses d <- mutate(d, century = ifelse(Year < 1800, "17th", ifelse(Year >= 1800 & Year < 1900, "18th", ifelse(Year >= 1900 & Year < 2000, "19th", "20th")))) three_clusters <- cluster(d, century, n_clusters = 3)#> Document-feature matrix of: 58 documents, 2,820 features (79.6% sparse).extract_terms(three_clusters)#> Cluster.1.Terms Cluster.1.Term.Frequencies Cluster.2.Terms #> 1 in 34.200000 in #> 2 my 13.866667 their #> 3 their 12.333333 govern #> 4 will 11.200000 will #> 5 govern 9.533333 has #> 6 peopl 7.200000 it #> 7 it 7.133333 state #> 8 nation 7.000000 been #> 9 has 6.733333 peopl #> 10 countri 6.533333 nation #> Cluster.2.Term.Frequencies Cluster.3.Terms Cluster.3.Term.Frequencies #> 1 77.52941 in 36.692308 #> 2 22.88235 will 16.076923 #> 3 21.41176 nation 12.500000 #> 4 20.29412 us 12.038462 #> 5 20.00000 world 9.807692 #> 6 19.41176 peopl 9.307692 #> 7 18.23529 can 7.769231 #> 8 17.82353 must 7.730769 #> 9 16.05882 america 7.423077 #> 10 14.41176 no 7.192308three_clusters_comparison <- compare(three_clusters, "century") compare_plot(three_clusters_comparison)