Natural Language Processing in the Social Sector — Supplemental Materials

Term Frequencies (TF-IDF)

A basic question to ask about a collection of texts is: “What are some common keywords used throughout?” Tf-idf (term frequency-inverse document frequency) is a heuristic that tries to score how important a term is to a particular document within a set of documents. The basic intuition is that if a term occurs frequently in a particular text, but only occurs in a few of the texts in the collection, then it is probably more important to that specific text.

Applied to our dataset of state bills, we find that the highest term counts (e.g., “include”, “bill”, “follow”) go to terms that also occur in the most bills. These are fairly generic words. The highest tf-idf scores (largest bubbles) go to terms that occur far less frequently (e.g., “police”, “animal”, “poverty”), but that tells us much more about the bills they from which come.

For example, if you hover over the word “homeless,” you will see it is the term with the highest tf-idf score for California bill AB-1733 Public records: fee waiver. The bill’s title and official summary — “An act to add Section 103577 to the Health and Safety Code, and to amend Section 14902 of the Vehicle Code, relating to public records” — do a poor job of communicating the purpose of the bill: to provide identification to homeless persons.

The formula that is used to compute the tf-idf for a term t of a document d in a document set is:

\[ \mathrm {tfidf} (t, d) = \mathrm {tf} (t, d) * \mathrm {idf} (t) \]

and the idf is computed as:

\[ \mathrm {idf} (t) = \log { \frac {n}{ \mathrm {df} (t) } } + 1 \]

where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t.

Sample code