While waiting on IRB approval to begin collecting and analyzing subreddit data for my project, I’ve been thinking about the rhetorical dimensions of computational topic modeling. In preparing to undertake this project, I looked for examples of similar analyses in humanities-based rhetoric and composition journals but found that this hasn’t been a terribly popular method, although the communication side of rhetoric has more frequently undertaken computational methods. Of course, given the humanities’ commitment to nuanced, reflexive, qualitative research, this perhaps aversion to quantitative text analysis makes sense. After all, the humanities have already developed robust means of collecting, contextualizing, and coding human-centered data. Frequently, the humanities have provided needed critiques of computational or algorithmic approaches that fail to account for embodiment, accrued meaning generated during the collection and analyzation of data, or the unevenly distributed risks for participants whose information is collected in large datasets. That said, as the editors of a recent issue of Computers and Composition, “Composing Algorithms: Writing (with) Rhetorical Machines,” write in their introduction, algorithms provide an opportunity for composition, not just critique. Furthermore, in their chapter “Against Cleaning” in Debates in the Digital Humanities 2019, Katie Rawson and Trevor Muñoz write
While humanities researchers do have discourses for limiting claims (acknowledging the choice of an archive or a particular intellectual tradition), the move into data-intensive research asks humanists to modify such discourses or develop new ones suitable for these projects. The ways in which the humanities engages these challenges may both open up new practices for other fields and allow humanities researchers who have made powerful critiques of the existing systems of data analysis to undertake data-intensive forms of research in ways that do not require them to abandon their commitments to such critiques.
In keeping with these invitations to engage data-centric research through the critiques already generated, I’m approaching this project through a digital cultural rhetorics lens. My hope is to create a final product (or really, an experience) that will represent the layers of decisions I make as a researcher and the ways these decisions create and alter relationships among elements of my project, thus shaping its overall meaning.
As part of this process, I’ve been reading tutorials about how to conduct computational topic modeling to use as examples as I create my own workflow, then adjusting these examples to maintain the “messiness” of rhetorical research. My primary concern in employing computational methods to a rhetorical problem (namely, what are the topics this subreddit discusses and how do those topics change over time or by the object of snark?) is to avoid overdetermining the results through the choices I make in setting up the algorithmic analysis. In a world with access to endless time and resources, I might read through every post and comment contained in the timeframe and code them inductively, then check my codes against those generated by other readers. However, this would take much more time than is really available within a PhD timeline. Hence, my interest in using computational topic modeling. In most text-analysis workflows, collected text is “cleaned” by removing stop-words, like articles and prepositions, and stemming or lemmatizing the remaining words, reducing them to their roots. (Stemming and lemmatizing are similar methods, but lemmatizing is sometimes favored because it’s more strongly contextual and less likely to return nonsense words than stemming.) While I’ll certainly remove stop-words from my Reddit corpus, I’ve decided to forgo lemmatization, in order to preserve the wordplay that forms the basis of this snark community. The results will likely be somewhat less tidy than if I were to lemmatize the corpus, but I’m leaning into the possibility.
Following Nicole Brown’s novel method of Computational Digital Autoethnography, I plan to use narrative and reflection to augment the results of my topic analysis. In her introductory computational digital autoethnography (CDA), Brown harvests a corpus of her own social media data consisting of all written Facebook text (posts, comments, captions) from 2007-2016. She then applies Latent Dirichlet allocation (LDA) topic modeling to surface topics in the data corpus. LDA treats words independently rather than as strings, so she runs the analysis with multiple topic numbers (variable K) to track the shifts in topic distributions. If LDA had been used as the sole or primary method in this study, certain gaps would have emerged in the data–some topics could be misinterpreted without careful attention and detailed knowledge of the data corpus and some topics could appear to be incoherent (and would normally be discarded). Brown fills these gaps with autoethnographic reflection, using the “failures” of LDA to reflect on data autonomy, representation of the self through text, and the value of incoherent topics or themes, which point to experiences that may not be fully captured in quantitative methods. After demonstrating the use of CDA, Brown concludes by observing that while CDA can be used to decolonize computational methods by turning them in favor of identity and justice-based work, it can also be used to colonize autoethnographic methods, if care is not taken, by delegitimizing or sidelining the narrative aspects of the work. While my project is not autoethnographic, I appreciate the texture of Brown’s method, which I think could be fruitful in a digital cultural rhetorics project, such as I undertake in this fellowship.