2016. augusztus 1., hétfő

[KATALIST] Megjelent a Code4Lib Journal 33. száma (magyar szerzőkkel!)

Tisztelt Katalist!

Megjelent a Code4Lib Journal 33. száma:

http://journal.code4lib.org/ issues/issue33

Külön érdekessége, az egyik cikk Bánki Zsolt remek, a Petőfi Irodalmi
Múzeum névadatbázisianak összevonásáról szóló Networkshopos
előadásának kibővített angol nyelvű változata.

A számban két cikk is foglalkozik az adatbányászattal. Monica Maceli
kifejezetten kezdőknek írt az R környezetről és programnyelvről, amit
mindenkinek ajánlok, aki szeretne belekóstolni a szövegbányászatba és
a statisztikába. Corey A. Harper egy nagyon fontos és elmélyült
tanulmányt írt a Digital Public Library of America metaadatainak
minőségvizsgálatáról, , illetve arról, hogy van-e kimutatható
összefüggés egyes minőségi szempontok és a rekordok használati
gyakorisága között. Külön kiemelem szellemes "metaadat ujjlenyomat"
vizualizációt, mellyen a szerző egyes rekordokat vagy
rekordcsoportokat jellemez. A Biodiversity Heritage Library évek óta a
világ egyik leinnovatívabb és -szorgalmasabb digilaizációs projektje
(ha nem csal meg az emlékezetem, néhány éve a Magyar Elektronikus
Könyvtárral is együttműködtek), ezúttal egy kudarcuk tanulságairól
számoltak be: miért nem lett sikeres az játékokba épített közösségi
alapú karakterfelismerés? A karakterfelismerő programokat használók, a
gamifikált szolgáltatásokat és a crowdsourcingot tervezők számára is
fontos cikk.

A teljes tartalomjegyzék:

Editorial Introduction – Summer Reading List
by Ron Peterson
http://journal.code4lib.org/ articles/11859
New additions for your summer reading list!

Emflix – Gone Baby Gone
by Netanel Ganin
http://journal.code4lib.org/ articles/11762
Enthusiasm is no replacement for experience. This article describes a
tool developed at the Emerson College Library by an eager but
overzealous cataloger. Attempting to enhance media-discovery in a
familiar and intuitive way, he created a browseable and searchable
Netflix-style interface. Though it may have been an interesting idea,
many of the crucial steps that are involved in this kind of
high-concept work were neglected. This article will explore and
explain why the tool ultimately has not been maintained or updated,
and what should have been done differently to ensure its legacy and
continued use.

Introduction to Text Mining with R for Information Professionals
by Monica Maceli
http://journal.code4lib.org/ articles/11626
The 'tm: Text Mining Package' in the open source statistical software
R has made text analysis techniques easily accessible to both novice
and expert practitioners, providing useful ways of analyzing and
understanding large, unstructured datasets. Such an approach can yield
many benefits to information professionals, particularly those
involved in text-heavy research projects. This article will discuss
the functionality and possibilities of text mining, as well as the
basic setup necessary for novice R users to employ the RStudio
integrated development environment (IDE). Common use cases, such as
analyzing a corpus of text documents or spreadsheet text data, will be
covered, as well as the text mining tools for calculating term
frequency, term correlations, clustering, creating wordclouds, and
plotting.

Data for Decision Making: Tracking Your Library's Needs With TrackRef
by Michael Carlozzi
http://journal.code4lib.org/ articles/11740
Library services must adapt to changing patron needs. These
adaptations should be data-driven. This paper reports on the use of
TrackRef, an open source and free web program for managing reference
statistics.

Are games a viable solution to crowdsourcing improvements to faulty
OCR? – The Purposeful Gaming and BHL experience
by Max J. Seidman; Dr. Mary Flanagan;Trish Rose-Sandler; Mike Lichtenberg
http://journal.code4lib.org/ articles/11781
The Missouri Botanical Garden and partners from Dartmouth, Harvard,
the New York Botanical Garden, and Cornell recently wrapped up a
project funded by IMLS called Purposeful Gaming and BHL: engaging the
public in improving and enhancing access to digital texts
(http://biodivlib.wikispaces. com/Purposeful+Gaming). The goals of the
project were to significantly improve access to digital texts through
the applicability of purposeful gaming for the completion of data
enhancement tasks needed for content found within the Biodiversity
Heritage Library (BHL). This article will share our approach in terms
of game design choices and the use of algorithms for verifying the
quality of inputs from players as well as challenges related to
transcriptions and marketing. We will conclude by giving an answer to
the question of whether games are a successful tool for analyzing and
improving digital outputs from OCR and whether we recommend their
uptake by libraries and other cultural heritage institutions.

From Digital Commons to OCLC: A Tailored Approach for Harvesting and
Transforming ETD Metadata into High-Quality Records
by Marielle Veve
http://journal.code4lib.org/ articles/11676
The library literature contains many examples of automated and
semi-automated approaches to harvest electronic theses and
dissertations (ETD) metadata from institutional repositories (IR) to
the Online Computer Library Center (OCLC). However, most of these
approaches could not be implemented with the institutional repository
software Digital Commons because of various reasons including
proprietary schema incompatibilities and high level programming
expertise requirements our institution did not want to pursue. Only
one semi-automated approach was found in the library literature which
met our requirements for implementation, and even though it catered to
the particular needs of the DSpace IR, it could be implemented to
other IR software if further customizations were applied.
The following paper presents an extension of this semi-automated
approach originally created by Deng and Reese, but customized and
adapted to address the particular needs of the Digital Commons
community and updated to integrate the latest Resource Description &
Access (RDA) content standards for ETDs. Advantages and disadvantages
of this workflow are discussed and presented as well.

Checking the identity of entities by machine algorithms: the next step
to the Hungarian National Namespace
by Zsolt Bánki, Tibor Mészáros, Márton Németh, András Simon
http://journal.code4lib.org/ articles/11765
The redundancy of entities coming from different sources caused
problems during the building of the personal name authorities for the
Petőfi Museum of Literature. It was a top priority to cleanse and
unite classificatory records which have different data content but
pertain to the same person without losing any data. As a first step in
2013, we found identities in approximately 80,000 name records so we
merged the data content of these records. In the second phase a much
more complicated algorithm had to be applied to show these identities.
We cleansed the database by uniting approximately 36,000 records. The
workflow for automatic detection of authority data tries to follow
human intelligence. The database scripts normalize and examine about
20 kinds of data elements according to information about dates,
localities, occupation and name variations. The result of creating
pairs from the database authority records, as potential redundant
elements, was a graph, which was condensed to a tree, by human efforts
of the curators of the museum. With this, the limit of technological
identification was reached. For further data cleansing human
intelligence that can be assisted by computerized regular monitoring
is needed, based upon the developed algorithm. As a result, the
service containing about 620,000 authority name records will be an
indispensable foundation to the establishment of the National Name
Authorities. This article shows the work process of unification.

Metadata Analytics, Visualization, and Optimization: Experiments in
statistical analysis of the Digital Public Library of America (DPLA)
by Corey A. Harper
http://journal.code4lib.org/ articles/11752
This paper presents the concepts of metadata assessment and
"quantification" and describes preliminary research results applying
these concepts to metadata from the Digital Public Library of America
(DPLA). The introductory sections provide a technical outline of data
pre-processing, and propose visualization techniques that can help us
understand metadata characteristics in a given context. Example
visualizations are shown and discussed, leading up to the use of
"metadata fingerprints" — D3 Star Plots — to summarize metadata
characteristics across multiple fields for arbitrary groupings of
resources. Fingerprints are shown comparing metadata characterisics
for different DPLA "Hubs" and also for used versus not used resources
based on Google Analytics "pageview" counts. The closing sections
introduce the concept of metadata optimization and explore the use of
machine learning techniques to optimize metadata in the context of
large-scale metadata aggregators like DPLA. Various statistical models
are used to predict whether a particular DPLA item is used based only
on its metadata. The article concludes with a discussion of the broad
potential for machine learning and data science in libraries, academic
institutions, and cultural heritage.

--
Király Péter
szerkesztő
http://linkedin.com/in/peterkiraly

_______________________________________________
Katalist mailing list
Katalist@listserv.niif.hu
https://listserv.niif.hu/mailman/listinfo/katalist