LLM Tokenization

All languages are NOT created (tokenized) equal: Language models cost much more in some languages than others

NLPmultilingual

Book Bans and Censorship in the United States

Currently, a vast diversity of stories being banned in the US: stories of LGBTQ+ communities, Muslim families, and women in science. LLMs such as GPT-3 refused to recommend banning books outright, for any age level.

NLPdata analysis

DALLE Red-Teaming

Part of the team of AI Researchers to probe ('red team') OpenAI's DALLE-2 prior to its public release to detect potential harms, biases, and disinformation.

AI Art

Scaling Radio Analysis with Data Science for Infodemic Monitoring

Evaluated and analyzed COVID-19 vaccine discourse on public radio transcriptions for public health monitoring. Master's Thesis with the Oxford Internet Institute and United Nations Global Pulse.

NLP

Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

Probed GPT-2 with prefix templates related to gender and occupation to evaluate biases in its predictions, which were compared wtih ground-truth US labor data.

NLPbias in AI

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Evaluated Facebook's Hateful Meme Challenge by comparing Facebook's carefully synthesized dataset with a collection of 'memes in the wild' gathered from Pinterest.

NLPmultimodal

Covid Texting Service

Created texting service for answering questions related to the pandemic and providing COVID-19 statistics to those without access to Internet. This is a working project with Silicon Harlem in NYC to get the service in the hands of people in need.

Demo: Text 1NYC to 313131

twiliofull-stack development

Data Surveillance and Biocitizenship in the COVID-19 Pandemic: Digital Contact-tracing in South Korea, Hong Kong, Singapore, and Taiwan

Analyzed the privacy implications of digital contact tracing during the early days of the COVID-19 pandemic through topic modeling and semantic network analysis of news media from 6 countries and 3 languages.

topic modelingtext cleaningnetworks

Big Data as Historical Archive: The Challenges of Preserving Today’s Digital Artifacts

Examined the greatest challenges for long-term preservation of big data, challenges which differ from the preservation of mostly static, smaller-scale digital material which had concerned archivists in the past. With Seoul National University Big Data Studies Lab.

big data studiesdigital historydigital humanities

Joseon Munkwa Project

Conducted named-entity recognition and disambiguation on historical figures from Korean Joseon-Dynasty civil service roster data. With Seoul National University Big Data Studies Lab.

digital historydigital humanitiesNERdisambiguation

Virtual Coffeeshop

Created virtual coffeeshop experience for those of us stuck at home during stay-at-home and seeking the vibe and comradeship of a cafe.

full-stack developmentreact

Music Factorization

Factorized a scale into parts that can be understood using combinations of 'symmetric' scales. An exercise in breaking down all scales into a combination of whole tone scales.

Pythonmusic theory