Dariusz ZDONEK

Silesian University of Technology in Gliwice, Poland

Abstract

The author looks into the issue of harvesting valid data from websites and social media sites of hospitals. The paper presents problems and the reliability of results of automated web data extraction using Python, APIs, and web scraping. The algorithm starts with the collection of valid URLs using names of hospitals and ends with the retrieval of hospitals’ news from their social media sites. The sample was 500 hospitals in Poland.

The automated online data harvesting method yielded result reliability of 81% to 94% depending on the scope of analysis. The reliability depends on the correctness of scripting. Still, some errors can be independent of the script. They could be caused by changed names of hospitals, security measures on API servers, and security of website hosting servers. The author suggests splitting automatic online data harvesting into stages, revising and manually correcting URLs for hospitals’ websites and social media sites, and implementing scraping repeats for missing data.

 

Keywords: web scraping, social media, Python, API.
Shares