An important reason for people to use microblog service is to seek and share information. Information sharing is done by posting tweets which can contain not only text, but also URL. We define the URLs appearing in tweets as tweeted URLs...The importance of tweeted URLs comes from the following fact that they are in large quantity and their content is generally of high quality, recent and influential..Information acquisition from tweeted URLs is a natural need for users and is the basis for many applications. Tweeted URLs have gained a lot of attention from industry, however, there is very limited research reported about them. ..In this project, we will provide a systematic study on tweeted URLs. Specifically, the project includes two subtopics: the first one is to give a statistical characterization on them while the second one is to model their content. Our final aim is to improve the document representation of tweeted URL by using information from its related context. The related context here refers to the tweets in which the URL appears and the users who have published the URL. The tweets can provide a supplementary description of the tweeted URL, so they can help understand the document topic. The tweet publishers or users often describe themselves in a short profile or label themselves with tags such as “machine learning” or “NLP” etc. These user’s tags can also indicate the document topic of tweeted URL to some degree, because the tags often describe user’s interest and user tend to post URLs of his or her interest. But note that the user’s tags here does not equal to the document tags in social bookmarking service, since they are not direct annotations on documents. It is not reasonable to assume that all the user’s tags accurately explain the document topics and all the documents can be described by their related user’s tags. ..Once we have realized the utility of tweeted URL context, we plan to design a topic model to capture the relationship of the tweets, the user’s tags and the document content corresponding to the tweeted URL. To make the learned topic have a better interpretation, we define a one-to-one mapping between the latent topics and user’s tags. Besides, according to our analysis on the usage of user’s tags, we propose a soft constraint to express the effect of user’s tags on topic model estimation. The soft constraint means that the document topic does not have to be completely limited in the topic scope of related user’s tags, but those topics from related user’s tags should be emphasized. ..The proposed model can be applied in recommendation tasks such as tag recommendation, tweet recommendation, tweeted URL recommendation and in document retrieval tasks such as tweeted URL search and tweet search. All of these tasks can improve the breadth and depth of information acquisition from microblog platform. .
用户使用微博的重要原因是信息搜集与分享。分享的消息不仅可以包括文字描述,也可以加入外部引用(URL)。我们将出现在微博消息中的URL称为微博资源。微博资源的重要性体现在:(1) 数据量大;(2) 时效性好;(3) 社会影响力大;(4)内容质量高。从微博资源中进行有效的信息获取是用户的自然需求,也是许多系统应用的基础。虽然微博资源广受工业界关注,但在学术界,相关研究刚刚起步。.本课题拟对微博资源进行系统性研究,通过全面分析微博资源的统计特性,构建微博资源内容模型,利用微博消息文本、用户标签提高对网页内容的理解。课题计划使用话题模型来建模消息、资源、用户标签三者的关系,并建立话题与用户标签的直接联系;针对用户标签特点,提出用户标签对话题模型的软约束假设,即要求文档话题与关联的标签话题有关系,但不完全限定在关联的标签话题内。所提模型可以广泛应用在推荐任务以及检索任务中。..
近年来,微博作为新的社交网络应用得到了长足发展。对微博消息中的URL-即微博资源进行理解并挖掘其价值是研究的关注热点。.本课题对微博资源进行研究,目标是增强对中文微博资源的理解,提高微博平台信息获取的效率。主要研究内容包括中文微博资源分析、高质量微博用户排序、基于用户标签的微博用户检索。此外,为了增强对文本内容的理解,研究了中文词以及情感词的分布式表示学习方法;为提高信息检索效率,对高压缩率的IPC编码的查询处理进行优化。.具体地,1)通过对微博资源进行统计,我们发现微博资源中的“art”、“social”、“news”、“game”内容占比最大,出现频次最多的网站除了传统的新闻门户,也包括大量新兴社交媒体网站(如微信、头条、bilibli等),反映了微博用户关注的内容偏好。2)在高质量微博用户排序方面,我们发现含URL的消息与用户在用户话题上的相关性要显著高于不含URL消息;提出了只利用含URL消息进行用户质量评价的方法,在达到最优排序效果的同时信息输入平均减少了80%。3)在基于用户标签的用户检索方面,探索了使用维基百科知识库的用户匹配方法,有效解决了领域词与用户标签的“词项失配”问题。4)在中文词的分布式表示学习方面,提出了词-字-偏旁的多粒度中文词表示学习方法MGE,有效提高了词相似度计算以及类比推理任务的效果;在情感词的分布式表示学习方面,提出使用外部情感词典SentiWordNet来学习词的情感向量,并提出结合词的情感向量以及普通词向量进行情感分析的方法。5)在基于IPC编码的查询处理方面,研究了基于部分解压的IPC编码线上处理方法PDIPC,有效提高了基于IPC编码的查询处理速度。.本项目共发表论文7篇,提交专利申请3项。本研究可以为后续面向社交网络资源的研究、面向中文词表示研究等提供参考。
{{i.achievement_title}}
数据更新时间:2023-05-31
跨社交网络用户对齐技术综述
粗颗粒土的静止土压力系数非线性分析与计算方法
黄河流域水资源利用时空演变特征及驱动要素
拥堵路网交通流均衡分配模型
中国参与全球价值链的环境效应分析
HMGA表达相关microRNA表观遗传调控对发育小脑放疗后神经细胞再生中NEPs细胞群活化的影响
基于用户建模的个性化微博排序研究
基于主题建模的微博语义理解与热点话题识别研究
微博热点话题传播模型与可视化研究
微博中定向话题发现与追踪