Herman Leung

RESEARCH & PROJECTS

[+/-] Chinese Dependency Treebanks

During my post as Senior Research Assistant to Prof. John S.Y. Lee at the City University of Hong Kong (2016-2017), I contributed to the development of new versions of dependency treebank annotation for Mandarin and Cantonese under the Universal Dependencies (UD) rubrik, as well as parallel treebanks in Cantonese and Mandarin, and Chinese as a Foreign Language.

Cantonese UD guidelines [dependency relations] [parts of speech]

Chinese (Mandarin) UD guidelines [dependency elations] [parts of speech]

Cantonese treebank [github]

Chinese (Mandarin) treebank [github]

Chinese as a Foreign Language treebank [github]

I have co-authored 4 conference papers related to the above work (see resume).

[+/-] Wiyot Language Database

[website]

I began this project in Fall 2014, in collaboration with Lynnika Butler at the Wiyot Tribe, in creating an online corpus for language learning and revitalization efforts within the tribe as well as for linguistics research.

As of 2016, the online database features a dictionary connected to a collection of sentences, texts, and audio files with advanced search functions.

[+/-] Acronym recognition and disambiguation

[github link]

In July 2015, I participated in the CDIPS Data Science Workshop and spent three weeks on a collaborative project in acronym recognition and disambiguation under the mentorship of Hossein Falaki from Databricks, Inc.

We used data from Wikipedia dump and employed cluster computing with Python and Apache Spark. Our overarching vision for the project was to create a text processing engine that can identify the meaning of an acronym in any text.

For three weeks, we worked on writing a robust acronym extraction code and an acronym dictionary, some exploratory analysis on acronym characteristics, and preliminary classification algorithms for acronym disambiguation.

[+/-] Yurok preverb ordering

This project examines the ordering facts of preverbal particles in Yurok. These particles number over 65 and have functions ranging from tense/aspect/modality to manner, associated motion, location, quantification, and negation. They can cluster in groups of up to four and can also be disjoined.

I have examined bigram patterns of preverb groups in order to determine the ordering facts, and investigated what principles account for them in Yurok and how they compare to known morphological ordering constraints, with the question of whether they can be considered non-affixal counterparts to verbal morphology.

"Preverbal nonaffixal ordering in Yurok (Algic)." Presentation at QP Fest, UC Berkeley, 11/10/2014. [pdf]

[+/-] Syntactic reanalysis of Cantonese coverbs

"Cantonese coverbs: A syntactic reanalysis." MA qualifying paper, UC Berkeley. Spring 2014. [pdf]

In this paper I reexamine previous proposals regarding the verbal nature of Cantonese coverbs in [V1 O1 V2 (O2)] constructions where V1 (the coverb) has preposition- and applicative-like functions. The paper adopts a model where [V1 O1] is an adjunct to the VP headed by V2, accounting for the inextractability of the coverb object as well as the ability of multiple coverb phrases to stack next to each other and to freely order with manner adverbs. Further investigation into [S V1 O1 V2 (O2)] constructions reveal that coverbs also have homonymous control verb functions, and that the two different functions of V1 participate in different syntactic processes.

"Syntactic reanalysis and the grammaticalization of Cantonese coverbs." Term paper for LING 230 Historical Linguistics with Andrew Garrett, Spring 2014. [pdf]

This paper follows up on the results of my first qualifying paper and explores a syntactically-motivated polygrammaticalization theory, where the control verb and coverb functions of V1 in [S V1 O1 V2 O2] constructions emerged from an originally biclausal construction.

[+/-] Poetic search engine

[github link]

The original goal of this collaborative course project was to create a match engine for lines for poetry, where given some input line (by a user or randomly from a poem), the engine would look through a corpus of poetry and return a few of the best matches to the input according to criteria such as syntax, semantics, etymology, and sound.

My contributions to the project consisted of four parts: (1) webscraping and creating an NLTK corpus of over 180,000 lines of poetry, (2) creating rhyming functions using the Carnegie Mellon University Pronouncing Dictionary (cmudict), (3) writing functions for matching etymological source between words using the Etymological Wordnet, and (4) exploring existing computational tools and data for determining semantic distance, including Wordnet and the Edinburgh Associative Thesaurus.

[+/-] Cantonese final particles

The unusually large inventory of Cantonese final particles (over 30 to 90 according to different analyses) has spawned numerous studies over the decades, starting with discourse and sociolinguistic approaches and later gaining more attention in lexico-semantic analyses. More recent work posit subsyllabic compositionality, reducing the 30-90 particles to a smaller set of segmental and tonal morphemes (at least 13 proposed by Sybesma and Li 2007).

I have been interested in further exploring and refining Sybesma and Li's proposal, particularly in seeing how a subsyllabic compositional theory may account for the large number of pragmatic uses of these particles. Additionally, I look into the final particles' interaction with the low boundary tone as well as their syntactic status, and also what implications the subsyllabic account has for clusters of final particles.

"Subsyllabic Semantics and Pragmatics in Cantonese Final Particles." Presentation at Syntax & Semantics Circle, UC Berkeley, 10/17/2013. [pdf]

[+/-] Extrametricality in Misantla Totonac

"Nominal and Verbal Extrametricality in Misantla Totonac." Term paper for 211A Advanced Phonological Theory with Sharon Inkelas, 5/9/2013. [pdf]

This paper is a short investigation into the different ways nouns and verbs in Misantla Totonac deal with primary stress assignment and extrametricality at the right edge. I propose an Optimality Theory account that reflects a co-phonological situation where both nouns and verbs are constrained by a weight-to-stress principle that considers coronal obstruents to be non-moraic, but where nouns are sensitive to the entire rime (nucleus and coda) whereas verbs are sensitive only to the coda.

[+/-] Ethnopoetic analysis of narrative speech in Cantonese

"'Who Caused This? Is This Fair?': A Multi-modal Ethnopoetic Analysis of Cantonese Oral Narrative and Rhetoric." MA Squib, San Francisco State University, Fall 2011. [pdf]

In this paper, I provide a multi-modal Hymesian ethnopoetic analysis of a 5.5-minute segment of polemical speech by a prominent Hong Kong grassroots activist. The monologue is analyzed in terms of its textual features (lexical, semantic, syntactic), prosody (pitch, intonation units, pause lengths), and gesture (specifically, head and eye movements), and they are found to be used jointly to structure the flow of the narratives as well as the arguments presented, contributing to recurring three- and five-part patterns that appear throughout the monologue. This paper supports growing literature which show that multi-modal analyses incorporating aural and visual cues in addition to textual/grammatical features greatly enhance our understanding of what structural components of language performance are particularly salient or intended to be salient.

Herman Leung
leung.hm@gmail.com