3  Quantitative analysis of textual data

3.1 Introduction

There are several R packages used for quantitative text analysis, but we will focus specifically on the quanteda package. So, first install the package from CRAN:

install.packages("quanteda")

Since the release of quanteda version 3.0, textstat_*, textmodel_* and textplot_* functions are available in separate packages. We will use several of these functions in the chapters below and strongly recommend installing these packages.

install.packages("quanteda.textmodels")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")

We will use the readtext package to read in different types of text data in these tutorials.

install.packages("readtext")

3.2 Quantitative data

Before beginning we need to load the libraries

And these ones lates for the modelling section:

3.2.1 Pre-formatted files

If your text data is stored in a pre-formatted file where one column contains the text and additional columns might store document-level variables (e.g. year, author, or language), you can import this into R using read_csv().

path_data <- system.file("extdata/", package = "readtext")
dat_inaug <- read_csv(paste0(path_data, "/csv/inaugCorpus.csv"))
Rows: 5 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): texts, President, FirstName
dbl (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(dat_inaug)
Rows: 5
Columns: 4
$ texts     <chr> "Fellow-Citizens of the Senate and of the House of Represent…
$ Year      <dbl> 1789, 1793, 1797, 1801, 1805
$ President <chr> "Washington", "Washington", "Adams", "Jefferson", "Jefferson"
$ FirstName <chr> "George", "George", "John", "Thomas", "Thomas"

The data set is about the inaugural speeches of the US presidents. So as we can see the data set is arranged in tabular form, with 5 rows and 4 columns. The columns are texts, Year, President, and FirstName.

Alternatively, you can use the readtext package to import character (comma- or tab-separated) values. readtext reads files containing text, along with any associated document-level variables. As an example, consider the following tsv file:

tsv_file <- paste0(path_data, "/tsv/dailsample.tsv")
cat(readLines(tsv_file, n = 4), sep = "\n")  # first 3 lines
speechID    memberID    partyID constID title   date    member_name party_name  const_name  speech
1   977 22  158 1. CEANN COMHAIRLE I gCOIR AN LAE.  1919-01-21  Count George Noble, Count Plunkett  Sinn Féin   Roscommon North Molaimse don Dáil Cathal Brugha, an Teachta ó Dhéisibh Phortláirge do bheith mar Cheann Comhairle againn indiu.
2   1603    22  103 1. CEANN COMHAIRLE I gCOIR AN LAE.  1919-01-21  Mr. Pádraic Ó Máille    Sinn Féin   Galway Connemara    Is bród mór damhsa cur leis an dtairgsin sin. Air sin do ghaibh CATHAL BRUGHA ceannus na Dála agus adubhairt:-
3   116 22  178 1. CEANN COMHAIRLE I gCOIR AN LAE.  1919-01-21  Mr. Cathal Brugha   Sinn Féin   Waterford County    ' A cháirde, tá obair thábhachtach le déanamh annso indiu, an obair is tábhachtaighe do rinneadh in Éirinn ón lá tháinic na Gaedhil go hÉirinn, agus is naomhtha an obair í. Daoine go bhfuil dóchas aca as Dia iseadh sinn go léir, daoine a chuireann suim I ndlighthibh Dé, agus dá bhrigh sin budh chóir dúinn congnamh d'iarraidh ar Dhia I gcóir na hoibre atá againn le déanamh. Iarrfad anois ar an sagart is dúthrachtaighe dár mhair riamh I nÉirinn, an tAthair Micheál Ó Flannagáin, guidhe chum an Spiorad Naomh dúinn chum sinn do stiúradh ar ár leas ar an mbóthar atá againn le gabháil. ' ' Agus, a cháirde, pé cineál creidimh atá ag éinne annso, iarrfad ar gach n-aon paidir do chur suas chum Dé, ó íochtar a chroidhe chum cabhair do thabhairt dúinn indiu. Glaodhaim anois ar an Athair Micheál Ó Flannagáin. ' Do tháinig an tATHAIR MICHEÁL Ó FLANNAGÁIN I láthair na Dála agus do léigh an phaidir seo I n-ár ndiaidh: 'Tair, A Spioraid Naomh, ath-líon croidhthe t'fhíoraon, agus adhain ionnta teine do ghrádha. ' ' Cuir chugainn do Spioraid agus cruthóchfar iad agus athnuadhfaidh Tú aghaidh na talmhan.' ' Guidhmís: ' ' A Dhia, do theagaisc croidhthe sa bhfíoraon le lonnradh an Spioraid Naoimh, tabhair dúinn, san Spiorad cheudna, go mblaisfimíd an ceart agus go mbéidh síorgháirdeachais orainn de bhárr a shóláis sin. Tré Íosa Críost ár dTighearna. Ámén'

The document itself in raw format is arranged in tabular form, separated by tabs. Each row contains a “document” (in this case, a speech) and the columns contain document-level variables. The column that contains the actual speech is named speech. To import this using readtext, you can use the following code:

dat_dail <- readtext(tsv_file, text_field = "speech")
glimpse(dat_dail)
Rows: 33
Columns: 11
$ doc_id      <chr> "dailsample.tsv.1", "dailsample.tsv.2", "dailsample.tsv.3"…
$ text        <chr> "Molaimse don Dáil Cathal Brugha, an Teachta ó Dhéisibh Ph…
$ speechID    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ memberID    <int> 977, 1603, 116, 116, 116, 116, 496, 116, 116, 2095, 116, 1…
$ partyID     <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22…
$ constID     <int> 158, 103, 178, 178, 178, 178, 46, 178, 178, 139, 178, 178,…
$ title       <chr> "1. CEANN COMHAIRLE I gCOIR AN LAE.", "1. CEANN COMHAIRLE …
$ date        <chr> "1919-01-21", "1919-01-21", "1919-01-21", "1919-01-21", "1…
$ member_name <chr> "Count George Noble, Count Plunkett", "Mr. Pádraic Ó Máill…
$ party_name  <chr> "Sinn Féin", "Sinn Féin", "Sinn Féin", "Sinn Féin", "Sinn …
$ const_name  <chr> "Roscommon North", "Galway Connemara", "Waterford County",…

3.2.2 Multiple text files

A second option to import data is to load multiple text files at once that are stored in the same folder or subfolders. Again, path_data is the location of sample files on your computer. Unlike the pre-formatted files, individual text files usually do not contain document-level variables. However, you can create document-level variables using the readtext package.

The directory /txt/UDHR contains text files (“.txt”) of the Universal Declaration of Human Rights in 13 languages.

path_udhr <- paste0(path_data, "/txt/UDHR")
list.files(path_udhr)  # list the files in this folder
 [1] "UDHR_chinese.txt"    "UDHR_czech.txt"      "UDHR_danish.txt"    
 [4] "UDHR_english.txt"    "UDHR_french.txt"     "UDHR_georgian.txt"  
 [7] "UDHR_greek.txt"      "UDHR_hungarian.txt"  "UDHR_icelandic.txt" 
[10] "UDHR_irish.txt"      "UDHR_japanese.txt"   "UDHR_russian.txt"   
[13] "UDHR_vietnamese.txt"

Each one of these txt files contains the text of the UDHR in the specific language. For instance, to inspect what each one of these files contain, we do the following:

# just first 5 lines
cat(readLines(file.path(path_udhr, "UDHR_chinese.txt"), n = 5), sep = "\n")  
世界人权宣言
联合国大会一九四八年十二月十日第217A(III)号决议通过并颁布 1948 年 12 月 10 日, 联 合 国 大 会 通 过 并 颁 布《 世 界 人 权 宣 言》。 这 一 具 有 历 史 意 义 的《 宣 言》 颁 布 后, 大 会 要 求 所 有 会 员 国 广 为 宣 传, 并 且“ 不 分 国 家 或 领 土 的 政 治 地 位 , 主 要 在 各 级 学 校 和 其 他 教 育 机 构 加 以 传 播、 展 示、 阅 读 和 阐 述。” 《 宣 言 》 全 文 如 下: 序言 鉴于对人类家庭所有成员的固有尊严及其平等的和不移的权利的 承 认, 乃 是 世 界 自 由、 正 义 与 和 平 的 基 础, 鉴 于 对 人 权 的 无 视 和 侮 蔑 已 发 展 为 野 蛮 暴 行, 这 些 暴 行 玷 污 了 人 类 的 良 心, 而 一 个 人 人 享 有 言 论 和 信 仰 自 由 并 免 予 恐 惧 和 匮 乏 的 世 界 的 来 临, 已 被 宣 布 为 普 通 人 民 的 最 高 愿 望, 鉴 于 为 使 人 类 不 致 迫 不 得 已 铤 而 走 险 对 暴 政 和 压 迫 进 行 反 叛, 有 必 要 使 人 权 受 法 治 的 保 护, 鉴 于 有 必 要 促 进 各 国 间 友 好 关 系 的 发 展, 鉴于各联合国国家的人民已在联合国宪章中重申他们对基本人 权、 人 格 尊 严 和 价 值 以 及 男 女 平 等 权 利 的 信 念, 并 决 心 促 成 较 大 自 由 中 的 社 会 进 步 和 生 活 水 平 的 改 善, 鉴于各会员国业已誓愿同联合国合作以促进对人权和基本自由的 普 遍 尊 重 和 遵 行, 鉴于对这些权利和自由的普遍了解对于这个誓愿的充分实现具有 很 大 的 重 要 性, 因 此 现 在, 大 会, 发 布 这 一 世 界 人 权 宣 言 , 作 为 所 有 人 民 和 所 有 国 家 努 力 实 现 的 共 同 标 准, 以 期 每 一 个 人 和 社 会 机 构 经 常 铭 念 本 宣 言, 努 力 通 过 教 诲 和 教 育 促 进 对 权 利 和 自 由 的 尊 重, 并 通 过 国 家 的 和 国 际 的 渐 进 措 施, 使 这 些 权 利 和 自 由 在 各 会 员 国 本 身 人 民 及 在 其 管 辖 下 领 土 的 人 民 中 得 到 普 遍 和 有 效 的 承 认 和 遵 行; 第一条

人 人 生 而 自 由, 在 尊 严 和 权 利 上 一 律 平 等。 他 们 赋 有 理 性 和 良 心, 并 应 以 兄 弟 关 系 的 精 神 相 对 待。 第二条 人 人 有 资 格 享 有 本 宣 言 所 载 的 一 切 权 利 和 自 由, 不 分 种 族、 肤 色、 性 别、 语 言、 宗 教、 政 治 或 其 他 见 解、 国 籍 或 社 会 出 身、 财 产、 出 生 或 其 他 身 分 等 任 何 区 别。 并 且 不 得 因 一 人 所 属 的 国 家 或 领 土 的 政 治 的、 行 政 的 或 者 国 际 的 地 位 之 不 同 而 有 所 区 别, 无 论 该 领 土 是 独 立 领 土、 托 管 领 土、 非 自 治 领 土 或 者 处 于 其 他 任 何 主 权 受 限 制 的 情 况 之 下。 第三条 人 人 有 权 享 有 生 命、 自 由 和 人 身 安 全。 第四条 任 何 人 不 得 使 为 奴 隶 或 奴 役; 一 切 形 式 的 奴 隶 制 度 和 奴 隶 买 卖, 均 应 予 以 禁 止。 第五条 任 何 人 不 得 加 以 酷 刑, 或 施 以 残 忍 的、 不 人 道 的 或 侮 辱 性 的 待 遇 或 刑 罚。 第六条 人 人 在 任 何 地 方 有 权 被 承 认 在 法 律 前 的 人 格。 第七条

To import these files, you can use the following code:

dat_udhr <- readtext(path_udhr)
glimpse(dat_udhr)
Rows: 13
Columns: 2
$ doc_id <chr> "UDHR_chinese.txt", "UDHR_czech.txt", "UDHR_danish.txt", "UDHR_…
$ text   <chr> "世界人权宣言\n联合国大会一九四八年十二月十日第217A(III)号决议…
Note

If you are using Windows, you need might need to specify the encoding of the file by adding encoding = "utf-8". In this case, imported texts might appear like <U+4E16><U+754C><U+4EBA><U+6743> but they indicate that Unicode charactes are imported correctly.

Here’s another example of multiple text files. The directory /txt/EU_manifestos contains text files (“.txt”) of the European Union manifestos in different languages.

path_eu <- paste0(path_data, "/txt/EU_manifestos/")
list.files(path_eu)  # list the files in this folder
 [1] "EU_euro_2004_de_PSE.txt" "EU_euro_2004_de_V.txt"  
 [3] "EU_euro_2004_en_PSE.txt" "EU_euro_2004_en_V.txt"  
 [5] "EU_euro_2004_es_PSE.txt" "EU_euro_2004_es_V.txt"  
 [7] "EU_euro_2004_fi_V.txt"   "EU_euro_2004_fr_PSE.txt"
 [9] "EU_euro_2004_fr_V.txt"   "EU_euro_2004_gr_V.txt"  
[11] "EU_euro_2004_hu_V.txt"   "EU_euro_2004_it_PSE.txt"
[13] "EU_euro_2004_lv_V.txt"   "EU_euro_2004_nl_V.txt"  
[15] "EU_euro_2004_pl_V.txt"   "EU_euro_2004_se_V.txt"  
[17] "EU_euro_2004_si_V.txt"  

You can generate document-level variables based on the file names using the docvarnames and docvarsfrom argument. dvsep = "_" specifies the value separator in the filenames. encoding = "ISO-8859-1" determines character encodings of the texts. Notice how the document variables are nicely generated from the file names.

dat_eu <- readtext(
  file = path_eu,
  docvarsfrom = "filenames", 
  docvarnames = c("unit", "context", "year", "language", "party"),
  dvsep = "_", 
  encoding = "ISO-8859-1"
)
glimpse(dat_eu)
Rows: 17
Columns: 7
$ doc_id   <chr> "EU_euro_2004_de_PSE.txt", "EU_euro_2004_de_V.txt", "EU_euro_…
$ text     <chr> "PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse…
$ unit     <chr> "EU", "EU", "EU", "EU", "EU", "EU", "EU", "EU", "EU", "EU", "…
$ context  <chr> "euro", "euro", "euro", "euro", "euro", "euro", "euro", "euro…
$ year     <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2…
$ language <chr> "de", "de", "en", "en", "es", "es", "fi", "fr", "fr", "gr", "…
$ party    <chr> "PSE", "V", "PSE", "V", "PSE", "V", "V", "PSE", "V", "V", "V"…

3.2.3 JSON

You can also read JSON files (.json) downloaded from the Twititer stream API. twitter.json is located in data directory of this tutorial package.

The JSON file looks something like this

{"created_at":"Wed Jun 07 23:30:01 +0000 2017","id":872596537142116352,"id_str":"872596537142116352","text":"@EFC_Jayy UKIP","display_text_range":[10,14],
"source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":872596176834572288,
"in_reply_to_status_id_str":"872596176834572288","in_reply_to_user_id":4556760676,"in_reply_to_user_id_str":"4556760676","in_reply_to_screen_name":"EFC_Jayy","user":{"id":863929468984995840,"id_str":"863929468984995840","name":"\u30b8\u30e7\u30fc\u30b8","screen_name":"CoysJoji","location":"Japan","url":null,"description":null,"protected":false,
"verified":false,"followers_count":367,"friends_count":304,"listed_count":1,"favourites_count":1260,"statuses_count":2930,"created_at":"Mon May 15 01:30:11 +0000 2017","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,
"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/870447188400365568\/RiR1hbCe_normal.jpg",
"profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/870447188400365568\/RiR1hbCe_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/863929468984995840\/1494897624","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,
"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"EFC_Jayy","name":"\u274c\u274c\u274c","id":4556760676,"id_str":"4556760676","indices":[0,9]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1496878201171"}

It’s a little hard to parse, but luckily we just leave it to the readtext package to do the job for us.

dat_twitter <- readtext("../data/twitter.json", source = "twitter")

The file comes with several metadata for each tweet, such as the number of retweets and likes, the username, time and time zone.

head(names(dat_twitter))
## [1] "doc_id"         "text"           "retweet_count"  "favorite_count"
## [5] "favorited"      "truncated"

3.2.4 PDF

readtext() can also convert and read PDF (“.pdf”) files. The directory /pdf/UDHR contains PDF files of the Universal Declaration of Human Rights in 13 languages. Each file looks like this:

dat_udhr <- readtext(
  paste0(path_data, "/pdf/UDHR/*.pdf"), 
  docvarsfrom = "filenames", 
  docvarnames = c("document", "language"),
  sep = "_"
)
print(dat_udhr)
readtext object consisting of 11 documents and 2 docvars.
# A data frame: 11 × 4
  doc_id           text                          document language
  <chr>            <chr>                         <chr>    <chr>   
1 UDHR_chinese.pdf "\"世界人权宣言\n\n联合\"..." UDHR     chinese 
2 UDHR_czech.pdf   "\"VŠEOBECNÁ \"..."           UDHR     czech   
3 UDHR_danish.pdf  "\"Den 10. de\"..."           UDHR     danish  
4 UDHR_english.pdf "\"Universal \"..."           UDHR     english 
5 UDHR_french.pdf  "\"Déclaratio\"..."           UDHR     french  
6 UDHR_greek.pdf   "\"ΟΙΚΟΥΜΕΝΙΚ\"..."           UDHR     greek   
# ℹ 5 more rows

3.2.5 Microsoft Word

Finally, readtext() can import Microsoft Word (“.doc” and “.docx”) files.

dat_word <- readtext(paste0(path_data, "/word/*.docx"))
print(dat_udhr)
readtext object consisting of 11 documents and 2 docvars.
# A data frame: 11 × 4
  doc_id           text                          document language
  <chr>            <chr>                         <chr>    <chr>   
1 UDHR_chinese.pdf "\"世界人权宣言\n\n联合\"..." UDHR     chinese 
2 UDHR_czech.pdf   "\"VŠEOBECNÁ \"..."           UDHR     czech   
3 UDHR_danish.pdf  "\"Den 10. de\"..."           UDHR     danish  
4 UDHR_english.pdf "\"Universal \"..."           UDHR     english 
5 UDHR_french.pdf  "\"Déclaratio\"..."           UDHR     french  
6 UDHR_greek.pdf   "\"ΟΙΚΟΥΜΕΝΙΚ\"..."           UDHR     greek   
# ℹ 5 more rows

3.3 Workflow

quanteda has three basic types of objects:

  1. Corpus

    • Saves character strings and variables in a data frame
    • Combines texts with document-level variables
  2. Tokens

  3. Document-feature matrix (DFM)

    • Represents frequencies of features in documents in a matrix
    • The most efficient structure, but it does not have information on positions of words
    • Non-positional (bag-of-words) analysis are profrmed using many of the textstat_* and textmodel_* functions

Text analysis with quanteda goes through all those three types of objects either explicitly or implicitly.

    graph TD
    D[Text files]
    V[Document-level variables]
    C(Corpus)
    T(Tokens)
    AP["Positional analysis (string-of-words)"]
    AN["Non-positional analysis (bag-of-words)"]
    M(DFM)
    style C stroke-width:4px
    style T stroke-width:4px
    style M stroke-width:4px
    D --> C
    V --> C 
    C --> T 
    T --> M
    T -.-> AP
    M -.-> AN

For example, if character vectors are given to dfm(), it internally constructs corpus and tokens objects before creating a DFM.

3.3.1 Corpus

You can create a corpus from various available sources:

  1. A character vector consisting of one document per element

  2. A data frame consisting of a character vector for documents, and additional vectors for document-level variables

3.3.1.1 Character vector

data_char_ukimmig2010 is a named character vector and consists of sections of British election manifestos on immigration and asylum.

str(data_char_ukimmig2010)
 Named chr [1:9] "IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN SOLVE. \n\n- At current immigration and birth rates,"| __truncated__ ...
 - attr(*, "names")= chr [1:9] "BNP" "Coalition" "Conservative" "Greens" ...
corp_immig <- corpus(
  data_char_ukimmig2010, 
  docvars = data.frame(party = names(data_char_ukimmig2010))
)
print(corp_immig)
Corpus consisting of 9 documents and 1 docvar.
BNP :
"IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN S..."

Coalition :
"IMMIGRATION.  The Government believes that immigration has e..."

Conservative :
"Attract the brightest and best to our country. Immigration h..."

Greens :
"Immigration. Migration is a fact of life.  People have alway..."

Labour :
"Crime and immigration The challenge for Britain We will cont..."

LibDem :
"firm but fair immigration system Britain has always been an ..."

[ reached max_ndoc ... 3 more documents ]
summary(corp_immig)
Corpus consisting of 9 documents, showing 9 documents:

         Text Types Tokens Sentences        party
          BNP  1125   3280        88          BNP
    Coalition   142    260         4    Coalition
 Conservative   251    499        15 Conservative
       Greens   322    679        21       Greens
       Labour   298    683        29       Labour
       LibDem   251    483        14       LibDem
           PC    77    114         5           PC
          SNP    88    134         4          SNP
         UKIP   346    723        26         UKIP

3.3.1.2 Data frame

Using read_csv(), load an example file from path_data as a data frame called dat_inaug. Note that your file does not need to be formatted as .csv. You can build a quanteda corpus from any file format that R can import as a data frame (see, for instance, the rio package for importing various files as data frames into R).

# set path
path_data <- system.file("extdata/", package = "readtext")

# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
names(dat_inaug)
[1] "texts"     "Year"      "President" "FirstName"

Construct a corpus from the “texts” column in dat_inaug.

corp_inaug <- corpus(dat_inaug, text_field = "texts")
print(corp_inaug)
Corpus consisting of 5 documents and 3 docvars.
text1 :
"Fellow-Citizens of the Senate and of the House of Representa..."

text2 :
"Fellow citizens, I am again called upon by the voice of my c..."

text3 :
"When it was first perceived, in early times, that no middle ..."

text4 :
"Friends and Fellow Citizens: Called upon to undertake the du..."

text5 :
"Proceeding, fellow citizens, to that qualification which the..."

3.3.1.3 Document-level variables

quanteda’s objects keep information associated with documents. They are called “document-level variables”, or “docvars”, and are accessed using docvars().

corp <- data_corpus_inaugural
head(docvars(corp))
  Year  President FirstName                 Party
1 1789 Washington    George                  none
2 1793 Washington    George                  none
3 1797      Adams      John            Federalist
4 1801  Jefferson    Thomas Democratic-Republican
5 1805  Jefferson    Thomas Democratic-Republican
6 1809    Madison     James Democratic-Republican

If you want to extract individual elements of document variables, you can specify field. Or you could just subset it as you normally would a data.frame.

docvars(corp, field = "Year")
 [1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845
[16] 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905
[31] 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965
[46] 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017 2021
corp$Year
 [1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845
[16] 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905
[31] 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965
[46] 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017 2021

So that means assignments to change document-level variables will work as usual in R. For example, you can change the Year variable to a factor (if you wished). And since the output of a docvars() function is a data.frame, you could subset or filter as you would a data.frame.

docvars(corp) |>
  filter(Year >= 1990)
  Year President FirstName      Party
1 1993   Clinton      Bill Democratic
2 1997   Clinton      Bill Democratic
3 2001      Bush George W. Republican
4 2005      Bush George W. Republican
5 2009     Obama    Barack Democratic
6 2013     Obama    Barack Democratic
7 2017     Trump Donald J. Republican
8 2021     Biden Joseph R. Democratic
# {quanteda} also provides corpus_subset() function, but since we learnt about
# dplyr, we can use it here.

Another useful feature is the ability to change the unit of texts. For example, the UK Immigration 2010 data set is a corpus of 9 documents, where each document is a speech by the political party.

corp <- corpus(data_char_ukimmig2010)
print(corp)
Corpus consisting of 9 documents.
BNP :
"IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN S..."

Coalition :
"IMMIGRATION.  The Government believes that immigration has e..."

Conservative :
"Attract the brightest and best to our country. Immigration h..."

Greens :
"Immigration. Migration is a fact of life.  People have alway..."

Labour :
"Crime and immigration The challenge for Britain We will cont..."

LibDem :
"firm but fair immigration system Britain has always been an ..."

[ reached max_ndoc ... 3 more documents ]

We can use corpus_reshape() to change the unit of texts. For example, we can change the unit of texts to sentences using the command below.

corp_sent <- corpus_reshape(corp, to = "sentences")
print(corp_sent)
Corpus consisting of 206 documents.
BNP.1 :
"IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN S..."

BNP.2 :
"The Scale of the Crisis  Britain's existence is in grave per..."

BNP.3 :
"In the absence of urgent action, we, the indigenous British ..."

BNP.4 :
"We, alone of all the political parties, have a decades-long ..."

BNP.5 :
"British People Set to be a Minority within 30 - 50 Years: Th..."

BNP.6 :
"Figures released by the ONS in January 2009 revealed that th..."

[ reached max_ndoc ... 200 more documents ]

The following code restores it back to the document level.

corp_doc <- corpus_reshape(corp_sent, to = "documents")
print(corp_doc)
Corpus consisting of 9 documents.
BNP :
"IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN S..."

Coalition :
"IMMIGRATION.  The Government believes that immigration has e..."

Conservative :
"Attract the brightest and best to our country.  Immigration ..."

Greens :
"Immigration.  Migration is a fact of life.  People have alwa..."

Labour :
"Crime and immigration  The challenge for Britain  We will co..."

LibDem :
"firm but fair immigration system  Britain has always been an..."

[ reached max_ndoc ... 3 more documents ]

3.3.2 Tokens

tokens() segments texts in a corpus into tokens (words or sentences) by word boundaries. By default, tokens() only removes separators (typically white spaces), but you can also remove punctuation and numbers.

toks <- tokens(corp_immig)
print(toks)
Tokens consisting of 9 documents and 1 docvar.
BNP :
 [1] "IMMIGRATION"  ":"            "AN"           "UNPARALLELED" "CRISIS"      
 [6] "WHICH"        "ONLY"         "THE"          "BNP"          "CAN"         
[11] "SOLVE"        "."           
[ ... and 3,268 more ]

Coalition :
 [1] "IMMIGRATION" "."           "The"         "Government"  "believes"   
 [6] "that"        "immigration" "has"         "enriched"    "our"        
[11] "culture"     "and"        
[ ... and 248 more ]

Conservative :
 [1] "Attract"     "the"         "brightest"   "and"         "best"       
 [6] "to"          "our"         "country"     "."           "Immigration"
[11] "has"         "enriched"   
[ ... and 487 more ]

Greens :
 [1] "Immigration" "."           "Migration"   "is"          "a"          
 [6] "fact"        "of"          "life"        "."           "People"     
[11] "have"        "always"     
[ ... and 667 more ]

Labour :
 [1] "Crime"       "and"         "immigration" "The"         "challenge"  
 [6] "for"         "Britain"     "We"          "will"        "control"    
[11] "immigration" "with"       
[ ... and 671 more ]

LibDem :
 [1] "firm"        "but"         "fair"        "immigration" "system"     
 [6] "Britain"     "has"         "always"      "been"        "an"         
[11] "open"        ","          
[ ... and 471 more ]

[ reached max_ndoc ... 3 more documents ]
toks_nopunct <- tokens(data_char_ukimmig2010, remove_punct = TRUE)
print(toks_nopunct)
Tokens consisting of 9 documents.
BNP :
 [1] "IMMIGRATION"  "AN"           "UNPARALLELED" "CRISIS"       "WHICH"       
 [6] "ONLY"         "THE"          "BNP"          "CAN"          "SOLVE"       
[11] "At"           "current"     
[ ... and 2,839 more ]

Coalition :
 [1] "IMMIGRATION"  "The"          "Government"   "believes"     "that"        
 [6] "immigration"  "has"          "enriched"     "our"          "culture"     
[11] "and"          "strengthened"
[ ... and 219 more ]

Conservative :
 [1] "Attract"     "the"         "brightest"   "and"         "best"       
 [6] "to"          "our"         "country"     "Immigration" "has"        
[11] "enriched"    "our"        
[ ... and 440 more ]

Greens :
 [1] "Immigration" "Migration"   "is"          "a"           "fact"       
 [6] "of"          "life"        "People"      "have"        "always"     
[11] "moved"       "from"       
[ ... and 598 more ]

Labour :
 [1] "Crime"       "and"         "immigration" "The"         "challenge"  
 [6] "for"         "Britain"     "We"          "will"        "control"    
[11] "immigration" "with"       
[ ... and 608 more ]

LibDem :
 [1] "firm"        "but"         "fair"        "immigration" "system"     
 [6] "Britain"     "has"         "always"      "been"        "an"         
[11] "open"        "welcoming"  
[ ... and 423 more ]

[ reached max_ndoc ... 3 more documents ]

You can see how keywords are used in the actual contexts in a concordance view produced by kwic().

kw <- kwic(toks, pattern =  "immig*")
head(kw, 10)
Keyword-in-context with 10 matches.                                                                 
   [BNP, 1]                                       | IMMIGRATION |
  [BNP, 16]                   SOLVE. - At current | immigration |
  [BNP, 78]                 a halt to all further | immigration |
  [BNP, 85]        the deportation of all illegal | immigrants  |
 [BNP, 169]          Britain, regardless of their | immigration |
 [BNP, 197] admission that they orchestrated mass | immigration |
 [BNP, 272]            grave peril, threatened by | immigration |
 [BNP, 374]                  ), legal Third World | immigrants  |
 [BNP, 531]        to second and third generation |  immigrant  |
 [BNP, 661]                     are added in, the |  immigrant  |
                                          
 : AN UNPARALLELED CRISIS WHICH           
 and birth rates, indigenous              
 , the deportation of all                 
 , a halt to the                          
 status. - The BNP                        
 to change forcibly Britain's demographics
 and multiculturalism. In the             
 made up 14.7 percent (                   
 mothers. Figures released by             
 birth rate is estimated to               
Note
  1. If you want to find multi-word expressions, separate words by white space and wrap the character vector by phrase(), as follows:

    kw_asylum <- kwic(toks, pattern = phrase("asylum seeker*"))
  2. Texts do not always appear nicely in your R console, so you can use View() to see the keywords-in-context in an interactive HTML table.

You can remove tokens that you are not interested in using tokens_select(). Usually we remove function words (grammatical words) that have little or no substantive meaning in pre-processing. stopwords() returns a pre-defined list of function words.

toks_nostop <- tokens_select(
  toks, 
  pattern = stopwords("en"), 
  selection = "remove"  # keep or remove
)
print(toks_nostop)
Tokens consisting of 9 documents and 1 docvar.
BNP :
 [1] "IMMIGRATION"  ":"            "UNPARALLELED" "CRISIS"       "BNP"         
 [6] "CAN"          "SOLVE"        "."            "-"            "current"     
[11] "immigration"  "birth"       
[ ... and 2,109 more ]

Coalition :
 [1] "IMMIGRATION"  "."            "Government"   "believes"     "immigration" 
 [6] "enriched"     "culture"      "strengthened" "economy"      ","           
[11] "must"         "controlled"  
[ ... and 146 more ]

Conservative :
 [1] "Attract"     "brightest"   "best"        "country"     "."          
 [6] "Immigration" "enriched"    "nation"      "years"       "want"       
[11] "attract"     "brightest"  
[ ... and 277 more ]

Greens :
 [1] "Immigration" "."           "Migration"   "fact"        "life"       
 [6] "."           "People"      "always"      "moved"       "one"        
[11] "country"     "another"    
[ ... and 377 more ]

Labour :
 [1] "Crime"            "immigration"      "challenge"        "Britain"         
 [5] "control"          "immigration"      "new"              "Australian-style"
 [9] "points-based"     "system"           "-"                "unlike"          
[ ... and 391 more ]

LibDem :
 [1] "firm"        "fair"        "immigration" "system"      "Britain"    
 [6] "always"      "open"        ","           "welcoming"   "country"    
[11] ","           "thousands"  
[ ... and 285 more ]

[ reached max_ndoc ... 3 more documents ]
Note

The stopwords() function returns character vectors of stopwords for different languages, using the ISO-639-1 language codes. For Malay, use stopwords("ms", source = "stopwords-iso"). For Bruneian specific context, you may need to amend the stopwords yourselves.

You can generate n-grams in any lengths from a tokens using tokens_ngrams(). N-grams are a contiguous sequence of n tokens from already tokenized text objects. So for example, in the phrase “natural language processing”:

  • Unigram (1-gram): “natural”, “language”, “processing”
  • Bigram (2-gram): “natural language”, “language processing”
  • Trigram (3-gram): “natural language processing”
# tokens_ngrams() also supports skip to generate skip-grams.
toks_ngram <- tokens_ngrams(toks_nopunct, n = 3, skip = 0)
head(toks_ngram[[1]], 20)  # the first political party's trigrams
 [1] "IMMIGRATION_AN_UNPARALLELED" "AN_UNPARALLELED_CRISIS"     
 [3] "UNPARALLELED_CRISIS_WHICH"   "CRISIS_WHICH_ONLY"          
 [5] "WHICH_ONLY_THE"              "ONLY_THE_BNP"               
 [7] "THE_BNP_CAN"                 "BNP_CAN_SOLVE"              
 [9] "CAN_SOLVE_At"                "SOLVE_At_current"           
[11] "At_current_immigration"      "current_immigration_and"    
[13] "immigration_and_birth"       "and_birth_rates"            
[15] "birth_rates_indigenous"      "rates_indigenous_British"   
[17] "indigenous_British_people"   "British_people_are"         
[19] "people_are_set"              "are_set_to"                 

3.3.3 Document feature matrix

dfm() constructs a document-feature matrix (DFM) from a tokens object.

toks_inaug <- tokens(data_corpus_inaugural, remove_punct = TRUE)
dfmat_inaug <- dfm(toks_inaug)
print(dfmat_inaug)
Document-feature matrix of: 59 documents, 9,419 features (91.89% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives
  1789-Washington               1  71 116      1  48     2               2
  1793-Washington               0  11  13      0   2     0               0
  1797-Adams                    3 140 163      1 130     0               2
  1801-Jefferson                2 104 130      0  81     0               0
  1805-Jefferson                0 101 143      0  93     0               0
  1809-Madison                  1  69 104      0  43     0               0
                 features
docs              among vicissitudes incident
  1789-Washington     1            1        1
  1793-Washington     0            0        0
  1797-Adams          4            0        0
  1801-Jefferson      1            0        0
  1805-Jefferson      7            0        0
  1809-Madison        0            0        0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,409 more features ]

Some useful functions to operate on DFMs are:

  1. ndoc(): returns the number of documents
  2. nfeat(): returns the number of features
  3. docnames(): returns the document names
  4. featnames(): returns the feature (column) names
  5. topfeatures(): returns the most frequent features
  6. docvars(): returns the document-level variables

DFMs sometimes behaves like normal matrices too, so you can use rowSums() and colSums() to calculate marginals.

Most commonly perhaps, is you want to select some columns (i.e. features) from the DFM that satisfy a pattern. For instance,

dfm_select(dfmat_inaug, pattern = "freedom")
Document-feature matrix of: 59 documents, 1 feature (38.98% sparse) and 4 docvars.
                 features
docs              freedom
  1789-Washington       0
  1793-Washington       0
  1797-Adams            0
  1801-Jefferson        4
  1805-Jefferson        2
  1809-Madison          1
[ reached max_ndoc ... 53 more documents ]
dfm_keep(dfmat_inaug, min_nchar = 5)
Document-feature matrix of: 59 documents, 8,560 features (93.04% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house representatives among
  1789-Washington               1      1     2               2     1
  1793-Washington               0      0     0               0     0
  1797-Adams                    3      1     0               2     4
  1801-Jefferson                2      0     0               0     1
  1805-Jefferson                0      0     0               0     7
  1809-Madison                  1      0     0               0     0
                 features
docs              vicissitudes incident event could filled
  1789-Washington            1        1     2     3      1
  1793-Washington            0        0     0     0      0
  1797-Adams                 0        0     0     1      0
  1801-Jefferson             0        0     0     0      0
  1805-Jefferson             0        0     0     2      0
  1809-Madison               0        0     0     1      1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 8,550 more features ]

There is also dfm_trim() to remove features that are too frequent or too rare, based on frequencies.

# Trim DFM containing features that occur less than 10 times in the corpus
dfm_trim(dfmat_inaug, min_termfreq = 10)
Document-feature matrix of: 59 documents, 1,524 features (68.93% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives
  1789-Washington               1  71 116      1  48     2               2
  1793-Washington               0  11  13      0   2     0               0
  1797-Adams                    3 140 163      1 130     0               2
  1801-Jefferson                2 104 130      0  81     0               0
  1805-Jefferson                0 101 143      0  93     0               0
  1809-Madison                  1  69 104      0  43     0               0
                 features
docs              among to life
  1789-Washington     1 48    1
  1793-Washington     0  5    0
  1797-Adams          4 72    2
  1801-Jefferson      1 61    1
  1805-Jefferson      7 83    2
  1809-Madison        0 61    1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 1,514 more features ]
# Trim DFM containing features that occur in more than 10% of the documents
dfm_trim(dfmat_inaug, max_docfreq = 0.1, docfreq_type = "prop")
Document-feature matrix of: 59 documents, 7,388 features (96.87% sparse) and 4 docvars.
                 features
docs              vicissitudes filled anxieties notification transmitted 14th
  1789-Washington            1      1         1            1           1    1
  1793-Washington            0      0         0            0           0    0
  1797-Adams                 0      0         0            0           0    0
  1801-Jefferson             0      0         0            0           0    0
  1805-Jefferson             0      0         0            0           0    0
  1809-Madison               0      1         0            0           0    0
                 features
docs              month summoned veneration fondest
  1789-Washington     1        1          1       1
  1793-Washington     0        0          0       0
  1797-Adams          0        0          2       0
  1801-Jefferson      0        0          0       0
  1805-Jefferson      0        0          0       0
  1809-Madison        0        0          0       0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 7,378 more features ]

3.4 Statistical analysis

Note: If you have not installed quanteda.corpora, do so by running

remotes::install_github("quanteda/quanteda.corpora")

3.4.1 Simple frequency analysis

Unlike topfeatures(), textstat_frequency() shows both term and document frequencies. You can also use the function to find the most frequent features within groups. Using the download() function from quanteda.corpora, you can retrieve a text corpus of tweets.

corp_tweets <- quanteda.corpora::download(url = "https://www.dropbox.com/s/846skn1i5elbnd2/data_corpus_sampletweets.rds?dl=1")

We can analyse the most frequent hashtags by applying tokens_keep(pattern = "#*") before creating a DFM.

toks_tweets <- 
  tokens(corp_tweets, remove_punct = TRUE) |>
  tokens_keep(pattern = "#*")
dfmat_tweets <- dfm(toks_tweets)

tstat_freq <- textstat_frequency(dfmat_tweets, n = 5, groups = lang)
head(tstat_freq, 20)
             feature frequency rank docfreq     group
1           #twitter         1    1       1    Basque
2     #canviemeuropa         1    1       1    Basque
3             #prest         1    1       1    Basque
4           #psifizo         1    1       1    Basque
5     #ekloges2014gr         1    1       1    Basque
6            #ep2014         1    1       1 Bulgarian
7         #yourvoice         1    1       1 Bulgarian
8      #eudebate2014         1    1       1 Bulgarian
9            #велико         1    1       1 Bulgarian
10           #ep2014         1    1       1  Croatian
11 #savedonbaspeople         1    1       1  Croatian
12   #vitoriagasteiz         1    1       1  Croatian
13           #ep14dk        32    1      32    Danish
14            #dkpol        18    2      18    Danish
15            #eupol         7    3       7    Danish
16        #vindtilep         7    3       7    Danish
17    #patentdomstol         4    5       4    Danish
18           #ep2014        35    1      35     Dutch
19              #vvd        11    2      11     Dutch
20               #eu         9    3       7     Dutch

You can also plot the Twitter hashtag frequencies easily using ggplot().

dfmat_tweets |>
  textstat_frequency(n = 15) |>
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequency") +
  theme_bw()

Alternatively, you can create a word cloud of the 100 most common hashtags.

set.seed(132)
textplot_wordcloud(dfmat_tweets, max_words = 100)

Finally, it is possible to compare different groups within one Wordcloud. We must first create a dummy variable that indicates whether a tweet was posted in English or a different language. Afterwards, we can compare the most frequent hashtags of English and non-English tweets.

# create document-level variable indicating whether tweet was in English or
# other language
corp_tweets$dummy_english <- 
  factor(ifelse(corp_tweets$lang == "English", "English", "Not English"))

# tokenize texts
toks_tweets <- tokens(corp_tweets)

# create a grouped dfm and compare groups
dfmat_corp_language <- 
  dfm(toks_tweets) |>
  dfm_keep(pattern = "#*") |>
  dfm_group(groups = dummy_english)

# create wordcloud
set.seed(132) # set seed for reproducibility
textplot_wordcloud(dfmat_corp_language, comparison = TRUE, max_words = 200)

3.4.2 Lexical diversity

Lexical diversity is a measure of how varied the vocabulary in a text or speech is. It indicates the richness of language use by comparing the number of unique words (types) to the total number of words (tokens) in the text. It is useful, for instance, for analysing speakers’ or writers’ linguistic skills, or the complexity of ideas expressed in documents.

A common metric for lexical diversity is the Type-Token Ratio (TTR), calculated as: \[ TTR = \frac{N_{\text{types}}}{N_{\text{tokens}}} \]

toks_inaug <- tokens(data_corpus_inaugural)
dfmat_inaug <- 
  dfm(toks_inaug) |>
  dfm_remove(pattern = stopwords("en"))  # similar to dfm_select()

tstat_lexdiv <- textstat_lexdiv(dfmat_inaug)
tail(tstat_lexdiv, 5)
     document       TTR
55  2005-Bush 0.6176753
56 2009-Obama 0.6828645
57 2013-Obama 0.6605238
58 2017-Trump 0.6409537
59 2021-Biden 0.5572316

We can prepare a plot using ggplot() as follows:

plot_df <-
  tstat_lexdiv |>
  mutate(id = row_number())

ggplot(plot_df, aes(id, TTR)) +
  geom_line() +
  scale_x_continuous(
    breaks = plot_df$id,
    labels = plot_df$document,
    name = NULL
  ) +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.4.3 Document/Feature similarity

Document/feature similarity is a measure of how alike two documents or sets of features are based on their content. It quantifies the degree to which documents share similar terms, topics, or characteristics.

textstat_dist() calculates similarities of documents or features for various measures. The output is compatible with R’s dist(), so hierarchical clustering can be performed without any transformation.

toks_inaug <- tokens(data_corpus_inaugural)
dfmat_inaug <- 
  dfm(toks_inaug) |>
  dfm_remove(pattern = stopwords("en"))  # similar to dfm_select()

# Calculate document similarity
dist_mat <- textstat_dist(dfmat_inaug)  # using Euclidean distance
dist_mat[1:3, 1:3]
3 x 3 Matrix of class "dspMatrix"
                1789-Washington 1793-Washington 1797-Adams
1789-Washington         0.00000        76.13803   141.4072
1793-Washington        76.13803         0.00000   206.6954
1797-Adams            141.40721       206.69543     0.0000

To plot this using ggplot(), we rely on the ggdendro package.

clust <- hclust(as.dist(dist_mat))  # hierarchical clustering

library(ggdendro)
dendr <- dendro_data(clust)
ggdendrogram(dendr, rotate = TRUE) 

3.4.4 Feature co-occurence matrix

A feature co-occurrence matrix (FCM) is a square matrix that counts the number of times two features co-occur in the same context, such as within the same document, sentence, or window of text. This is a special object in quanteda, but behaves similarly to a DFM. As an example, consider the following:

tibble(
  doc_id = 1:2,
  text = c("I love Mathematics.", "Mathematics is awesome.")
) |>
  corpus() |>
  tokens(remove_punct = TRUE) |>
  fcm(context = "document")  # by default
Feature co-occurrence matrix of: 5 by 5 features.
             features
features      I love Mathematics is awesome
  I           0    1           1  0       0
  love        0    0           1  0       0
  Mathematics 0    0           0  1       1
  is          0    0           0  0       1
  awesome     0    0           0  0       0

Let’s download the data_corpus_guardian corpus from the quanteda.corpora package.

corp_news <- quanteda.corpora::download("data_corpus_guardian")

When a corpus is large, you have to select features of a DFM before constructing a FCM. In the example below, we clean up as follows:

  1. Remove all stopwords and punctuation characters.
  2. Remove certain patterns that usually describe the publication time and date of articles.
  3. Keep only terms that occur at least 100 times in the document-feature matrix.
toks_news <- tokens(corp_news, remove_punct = TRUE)
dfmat_news <- 
  dfm(toks_news) |>
  dfm_remove(pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst", "|")) |>
  dfm_trim(min_termfreq = 100)

topfeatures(dfmat_news)
      said     people        one        new       also         us        can 
     28413      11169       9884       8024       7901       7091       6972 
government       year       last 
      6821       6570       6335 
nfeat(dfmat_news)
[1] 4211

To construct an FCM from a DFM (or a tokens object), use fcm(). You can visualise the FCM using a textplot_network() graph as follows:

fcmat_news <- fcm(dfmat_news, context = "document")
feat <- names(topfeatures(dfmat_news, 30))  # Top 30 features
fcmat_news_select <- fcm_select(fcmat_news, pattern = feat, selection = "keep")
dim(fcmat_news_select)
[1] 30 30
set.seed(123)
quanteda.textplots::textplot_network(fcmat_news_select)

3.5 Scaling and classification

In this section we apply mainly unsupervised learning models to textual data. Scaling and classification aim to uncover hidden structures, relationships, and patterns within textual data by placing texts or words on latent scales (scaling) and grouping them into meaningful categories or themes (classification). This process transforms complex, high-dimensional text into more interpretable and actionable insights.

3.5.1 Wordfish

Wordfish is a Poisson scaling model of one-dimensional document positions (Slapin and Proksch 2008). This model is used primarily for scaling political texts to position documents (like speeches or manifestos) on a latent dimension, often reflecting ideological or policy positions. The main objective is to identify the relative positioning of texts on a scale (e.g., left-right political spectrum) based on word frequencies.

Let \(y_{ij}\) be the count of word \(j\) in document \(i\). Then assume

\[\begin{align} y_{ij} &\sim \operatorname{Poi}(\lambda_{ij}) \\ \log (\lambda_{ij}) &= \psi_j +\beta_j \theta_i \end{align}\]

In this example, we will show how to apply Wordfish to the Irish budget speeches from 2010. First, we will create a document-feature matrix. Afterwards, we will run Wordfish.

toks_irish <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)
dfmat_irish <- dfm(toks_irish)

# Run Wordfish model
tmod_wf <- textmodel_wordfish(dfmat_irish, dir = c(6, 5))
summary(tmod_wf)

Call:
textmodel_wordfish.dfm(x = dfmat_irish, dir = c(6, 5))

Estimated Document Positions:
                             theta      se
Lenihan, Brian (FF)        1.79403 0.02007
Bruton, Richard (FG)      -0.62160 0.02823
Burton, Joan (LAB)        -1.13503 0.01568
Morgan, Arthur (SF)       -0.07841 0.02896
Cowen, Brian (FF)          1.77846 0.02330
Kenny, Enda (FG)          -0.75350 0.02635
ODonnell, Kieran (FG)     -0.47615 0.04309
Gilmore, Eamon (LAB)      -0.58406 0.02994
Higgins, Michael (LAB)    -1.00383 0.03964
Quinn, Ruairi (LAB)       -0.92648 0.04183
Gormley, John (Green)      1.18361 0.07224
Ryan, Eamon (Green)        0.14738 0.06321
Cuffe, Ciaran (Green)      0.71541 0.07291
OCaolain, Caoimhghin (SF) -0.03982 0.03878

Estimated Feature Scores:
        when      i presented    the supplementary  budget     to   this  house
beta -0.1594 0.3177    0.3603 0.1933         1.077 0.03537 0.3077 0.2473 0.1399
psi   1.6241 2.7239   -1.7958 5.3308        -1.134 2.70993 4.5190 3.4603 1.0396
       last   april    said     we   could   work    our   way through  period
beta 0.2419 -0.1565 -0.8339 0.4156 -0.6138 0.5221 0.6892 0.275  0.6115  0.4985
psi  0.9853 -0.5725 -0.4514 3.5125  1.0858 1.1151 2.5278 1.419  1.1603 -0.1779
         of severe economic distress   today   can  report    that
beta 0.2777  1.229   0.4237    1.799 0.09141 0.304  0.6196 0.01506
psi  4.4656 -2.013   1.5714   -4.456 0.83875 1.564 -0.2466 3.83785
     notwithstanding difficulties   past
beta           1.799        1.175 0.4746
psi           -4.456       -1.357 0.9321

The R output shows the results of the Wordfish model applied to Irish political texts, estimating the ideological positions of various politicians. Each politician is assigned a “theta” value, representing their placement on a latent scale; positive values indicate one end of the spectrum, while negative values indicate the opposite.

For example, Brian Lenihan (FF) has a high positive theta, suggesting a strong position on one side, while Joan Burton (LAB) has a negative theta, placing her on the other side. The model also provides feature scores for words (beta values), indicating their importance in distinguishing between these positions. Words with higher absolute beta values, such as “supplementary,” are key in differentiating the ideological content of the texts, while psi values reflect word frequency variance, contributing to the model’s differentiation of document positions.

We can plot the results of a fitted scaling model using textplot_scale1d().

Note

The value of 0 for theta in the Wordfish model is not a true zero in an absolute sense. Instead, it serves as a relative reference point on the latent scale. In Wordfish, theta values are relative, meaning they indicate positions along a spectrum where the direction (positive or negative) is determined by the model’s scaling based on the data and specified parameters.

The function also allows you to plot scores by a grouping variable, in this case the party affiliation of the speakers.

textplot_scale1d(tmod_wf, groups = dfmat_irish$party)

Finally, we can plot the estimated word positions and highlight certain features.

textplot_scale1d(
  tmod_wf, 
  margin = "features", 
  highlighted = c("government", "global", "children", 
                  "bank", "economy", "the", "citizenship",
                  "productivity", "deficit")
)

Beta (x-axis) Reflects how strongly a word is associated with the latent dimension (e.g., ideological position). Words with high absolute beta values are more influential in distinguishing between different positions; positive beta values indicate words more associated with one end of the scale, while negative values indicate the opposite.

Psi (y-axis) Represents the variance in word frequency. Higher psi values suggest that the word occurs with varying frequency across documents, while lower values indicate more consistent usage.

Therefore, words in the upper right (high beta, high psi) are influential and variably used, indicating key terms that may strongly differentiate between document positions. Words in the lower left (low beta, low psi) are less influential and consistently used, likely serving as common or neutral terms.

The plot also helps identify which words are driving the distinctions in the latent scale and how their usage varies across documents.

3.5.2 Topic models

Topic models are statistical models used to identify the underlying themes or topics within a large collection of documents. They analyze word co-occurrences across documents to group words into topics, where each topic is a distribution over words, and each document is a mixture of these topics.

A common topic model is Latent Dirichlet Allocation (LDA), which assumes that each document contains multiple topics in varying proportions. Topic models help uncover hidden semantic structures in text, making them useful for organizing, summarizing, and exploring large text datasets. In R, we use the seededlda package for LDA.

# install.packages("seededlda")
library(seededlda)

Back to the Guardian data, corp_news. We will select only news articles published in 2016 using corpus_subset() function and the year() function from the lubridate package

corp_news_2016 <- corpus_subset(corp_news, year(date) == 2016)
ndoc(corp_news_2016)
[1] 1959

Further, after removal of function words and punctuation in dfm(), we will only keep the top 20% of the most frequent features (min_termfreq = 0.8) that appear in less than 10% of all documents (max_docfreq = 0.1) using dfm_trim() to focus on common but distinguishing features.

# Create tokens
toks_news <- 
  tokens(
    corp_news_2016, 
    remove_punct = TRUE, 
    remove_numbers = TRUE, 
    remove_symbol = TRUE
  ) |>
  tokens_remove(
    pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst")
  )

# Create DFM
dfmat_news <- 
  dfm(toks_news) %>% 
  dfm_trim(
    min_termfreq = 0.8, 
    termfreq_type = "quantile",
    max_docfreq = 0.1, 
    docfreq_type = "prop"
  )

The LDA is fitted using the code below. Note that k = 10 specifies the number of topics to be discovered. This is an important parameter and you should try a variety of values and validate the outputs of your topic models thoroughly.

# Takes a while to fit!
tmod_lda <- seededlda::textmodel_lda(dfmat_news, k = 10)

You can extract the most important terms for each topic from the model using terms(). Each column (topic1, topic2, etc.) lists words that frequently co-occur in the dataset, suggesting a common theme within each topic.

terms(tmod_lda, 10)
      topic1     topic2       topic3         topic4       topic5    topic6    
 [1,] "syria"    "labor"      "johnson"      "funding"    "son"     "officers"
 [2,] "refugees" "australia"  "brussels"     "housing"    "church"  "violence"
 [3,] "isis"     "australian" "talks"        "nhs"        "black"   "doctors" 
 [4,] "military" "corbyn"     "boris"        "income"     "love"    "prison"  
 [5,] "syrian"   "turnbull"   "benefits"     "education"  "father"  "victims" 
 [6,] "un"       "budget"     "summit"       "scheme"     "felt"    "sexual"  
 [7,] "islamic"  "leadership" "negotiations" "fund"       "parents" "abuse"   
 [8,] "forces"   "shadow"     "ireland"      "green"      "story"   "hospital"
 [9,] "turkey"   "senate"     "migrants"     "homes"      "visit"   "criminal"
[10,] "muslim"   "coalition"  "greece"       "businesses" "read"    "crime"   
      topic7      topic8    topic9          topic10     
 [1,] "oil"       "clinton" "climate"       "sales"     
 [2,] "markets"   "sanders" "water"         "apple"     
 [3,] "prices"    "cruz"    "energy"        "customers" 
 [4,] "banks"     "hillary" "food"          "users"     
 [5,] "investors" "obama"   "gas"           "google"    
 [6,] "shares"    "trump's" "drug"          "technology"
 [7,] "trading"   "bernie"  "drugs"         "games"     
 [8,] "china"     "ted"     "environmental" "game"      
 [9,] "rates"     "rubio"   "air"           "iphone"    
[10,] "quarter"   "senator" "emissions"     "app"       

As an example, Topic 1 (“syria”, “refugees”, “isis”), is likely related to international conflicts, specifically around Syria and refugee crises. Topic 4 (“funding”, “housing”, “nhs”) is likely related to public services and social welfare issues, such as healthcare and housing. Each topic provides a distinct theme, derived from the words that frequently appear together in the corpus, helping to summarize and understand the main themes in the text.

You can then obtain the most likely topics using topics() and save them as a document-level variable.

# assign topic as a new document-level variable
dfmat_news$topic <- topics(tmod_lda)

# cross-table of the topic frequency
table(dfmat_news$topic)

 topic1  topic2  topic3  topic4  topic5  topic6  topic7  topic8  topic9 topic10 
    203     222      85     211     249     236     180     186     196     184 

In the seeded LDA, you can pre-define topics in LDA using a dictionary of “seeded” words. For more information, see the seededlda package documentation.

3.5.3 Latent semantic scaling

Latent Semantic Scaling (LSS) is a method used to place words or documents on a latent scale that represents an underlying dimension, such as sentiment, ideology, or any other semantic axis. The key idea is to use the co-occurrence patterns of words across documents to identify and position items along this hidden dimension.

LSS is performed using the LSX package. In this example, we will apply LSS to the corpus of Sputnik articles about Ukraine. First, we prepare the data set.

# Read the RDS file directly from the URL
corp <- readRDS(url("https://www.dropbox.com/s/abme18nlrwxgmz8/data_corpus_sputnik2022.rds?dl=1"))

toks <-
  corp |>
  corpus_reshape("sentences") |>  # this is a must!
  tokens(
    remove_punct = TRUE, 
    remove_symbols = TRUE, 
    remove_numbers = TRUE, 
    remove_url = TRUE
  )

dfmt <-
  dfm(toks) |>
  dfm_remove(pattern = stopwords("en"))

Now to run an LSS model, run the following command:

lss <- textmodel_lss(
  dfmt, 
  seeds = as.seedwords(data_dictionary_sentiment), 
  k = 300, 
  # cache = TRUE, 
  include_data = TRUE, 
  group_data = TRUE
)

Taking the DFM and the seed words as the only inputs, textmodel_lss() computes the polarity scores of all the words in the corpus based on their semantic similarity to the seed words. You usually do not need to change the value of k (300 by default).

Let’s look at the output of the LSS model:

summary(lss)

Call:
textmodel_lss(x = dfmt, seeds = as.seedwords(data_dictionary_sentiment), 
    k = 300, include_data = TRUE, group_data = TRUE)

Seeds:
       good        nice   excellent    positive   fortunate     correct 
          1           1           1           1           1           1 
   superior         bad       nasty        poor    negative unfortunate 
          1          -1          -1          -1          -1          -1 
      wrong    inferior 
         -1          -1 

Beta:
(showing first 30 elements)
          excellent            positive              gander                good 
             0.2102              0.1918              0.1883              0.1829 
         diplomatic           staffer's              ex-kgb             correct 
             0.1802              0.1773              0.1771              0.1749 
               mend  inter-parliamntary         workmanship      russo-american 
             0.1743              0.1719              0.1715              0.1713 
             nicety           china-u.s               soufi good-neighborliness 
             0.1712              0.1674              0.1664              0.1644 
       china-canada             fidgety           relations           downtrend 
             0.1616              0.1611              0.1584              0.1578 
            cordial        canada-china            superior       understanding 
             0.1573              0.1569              0.1569              0.1561 
   good-neighbourly           brennan's              mutual          reaffirmed 
             0.1550              0.1544              0.1538              0.1534 
              blida          abdelkader 
             0.1533              0.1533 

Data Dimension:
[1]  8063 59711

Polarity scores in Latent Semantic Scaling (LSS) quantify how words or documents relate to a specific dimension (e.g., sentiment) based on predefined seed words. Seed words represent the extremes of this dimension (e.g., “good” vs. “bad”). LSS analyzes how other words co-occur with these seed words to assign a score.

We can visualize the polarity of words using textplot_terms(). If you pass a dictionary to be highlighted, words are plotted in different colors. data_dictionary_LSD2015 is a widely-used sentiment dictionary. If highlighted = NULL, words are selected randomly to highlight.

textplot_terms(lss, highlighted = data_dictionary_LSD2015[1:2])

Based on the fitted model, we can now predict polarity scores of documents using predict(). It is best to work with the document-level data frame, which we will then add a new column for the predicted polarity scores.

dat <- docvars(lss$data)
dat$lss <- predict(lss)
glimpse(dat)
Rows: 8,063
Columns: 5
$ head <chr> "Biden: US Desires Diplomacy But 'Ready No Matter What Happens' I…
$ url  <chr> "https://sputniknews.com/20220131/biden-us-desires-diplomacy-but-…
$ time <dttm> 2022-02-01 03:25:22, 2022-02-01 01:58:19, 2022-02-01 01:47:56, 2…
$ date <date> 2022-01-31, 2022-01-31, 2022-01-31, 2022-01-31, 2022-01-31, 2022…
$ lss  <dbl> 0.10093989, 0.99263241, -0.76609160, 0.15991318, 0.25420329, -1.5…

Basically what we have is a data frame, where each row represents a single document (here, a news item from Sputnik with a timestamp). Each document also has a predicted polarity score based on the LSS model. We can visualise this easily using ggplot(). But first, we need to smooth the scores using smooth_lss() (otherwise it is too rough to interpret).

dat_smooth <- smooth_lss(dat, lss_var = "lss", date_var = "date")

ggplot(dat_smooth, aes(x = date, y = fit)) + 
  geom_line() +
  geom_ribbon(
    aes(ymin = fit - se.fit * 1.96, 
        ymax = fit + se.fit * 1.96), 
    alpha = 0.1
  ) +
  geom_vline(xintercept = as.Date("2022-02-24"), linetype = "dotted") +
  scale_x_date(date_breaks = "months", date_labels = "%b %Y", name = NULL) +
  labs(title = "Sentiment about Ukraine", x = "Date", y = "Sentiment") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The plot shows that the sentiment of the articles about Ukraine became more negative in March but more positive in April. Zero on the Y-axis is the overall mean of the score; the dotted vertical line indicate the beginning of the war.

References

Slapin, Jonathan B., and Sven-Oliver Proksch. 2008. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52 (3): 705–22. https://doi.org/10.1111/j.1540-5907.2008.00338.x.