dc.identifier.uri	http://hdl.handle.net/11401/77239
dc.description.sponsorship	This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.	en_US
dc.format	Monograph
dc.format.medium	Electronic Resource	en_US
dc.language.iso	en_US
dc.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dc.type	Dissertation
dcterms.abstract	Language on the Internet and social media varies due to time, geography, and social factors. For example, consider an online chat forum where people from different regions across the world interact. In such scenarios, it is important to track and detect regional variation in language. A person from the UK, who is in conversation with someone from the USA could say ``he is stuck in the lift'' to mean ''he is stuck in an elevator'', since the word "lift" means an "elevator" in the UK. Note that in the US, "lift" does not refer to an "elevator". Modeling such variation can allow for applications to prompt or suggest the intended meaning to the other participants of the conversation. In this thesis, we conduct two related lines of inquiry focusing on (a) language itself and the variation it manifests and (b) the user and what we can infer about them based on their language use on social media. First, we develop computational methods to track and detect changes in word usage, including semantic and syntactic variation. We examine three modalities: time, geography and domains. Specifically, we outline methods to use distributional word representations (word embeddings) to detect semantic variation in word usage. Our methods are scalable to large datasets, making them particularly suited for social media. Second, we turn our attention towards users. In particular, we model latent traits of users based on their everyday language use on social media. We develop latent factor models, that explicitly seek to build representations of each user based on their inferred latent traits. These models capture latent traits that serve as useful co-variates for a wide variety of tasks like predicting what topics users like on social media and the number of friends in their social circle. This work has broad applications in several fields like information retrieval, semantic web applications, socio-variational linguistics, and computational social science including digital health care and ad-targeting.
dcterms.available	2017-09-20T16:52:15Z
dcterms.contributor	Balasubramanian, Niranjan	en_US
dcterms.contributor	Skiena, Steven	en_US
dcterms.contributor	Schwartz, H Andrew	en_US
dcterms.contributor	Bamman, David.	en_US
dcterms.creator	Kulkarni, Vivek V.
dcterms.dateAccepted	2017-09-20T16:52:15Z
dcterms.dateSubmitted	2017-09-20T16:52:15Z
dcterms.description	Department of Computer Science	en_US
dcterms.extent	115 pg.	en_US
dcterms.format	Application/PDF	en_US
dcterms.format	Monograph
dcterms.identifier	http://hdl.handle.net/11401/77239
dcterms.issued	2017-05-01
dcterms.language	en_US
dcterms.provenance	Made available in DSpace on 2017-09-20T16:52:15Z (GMT). No. of bitstreams: 1 Kulkarni_grad.sunysb_0771E_13258.pdf: 9498397 bytes, checksum: 3133d4116308bf0ae7273283779272d4 (MD5) Previous issue date: 1	en
dcterms.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subject	Computational Social Science, Language change, Variational Linguistics, Word Embeddings
dcterms.subject	Computer science
dcterms.title	Statistical Models for Linguistic Variation in Online Media
dcterms.type	Dissertation

Files in this item

Name:: Kulkarni_grad.sunysb_0771E_132 ...
Size:: 9.058Mb
Format:: application/pdf

View/Open

This item appears in the following Collection(s)

Stony Brook Theses and Dissertations Collection [4009]

Show simple item record

Statistical Models for Linguistic Variation in Online Media

Files in this item

This item appears in the following Collection(s)