About our project:

This is a continuation of a team-project IKEA Meatballs (https://github.com/zccwqdoorchid/IKEA-meatballs/tree/main/Final_Project) by XinyanMO, nicksnlp, and zccwqdoorchid.

The project runs on flask and requires the following libraries: scikit-learn, spacy, beautifulsoup, matplotlib, validators.

The following improvements have been made:

Wildcard search, bugs fixed.
Boolean search, bugs fixed.
The structure of the project: the functions are moved into a class GallerySearch within gallery_search.py, from where the call of the web_scraping.py and data_visualization.py are initiated.
The main flask_app.py handles the html, and feeds the url-link and the query from user into the gallery_search.py. (This way way the project can be easier handled by web hosting environments, and/or incorporated into other applications).
Scraping data is saved into an external scraped_data.json, as a dictionary. If it is not present there, the new scraping is initiated (may take around 5 minutes to complete).
"Scrape the WEB again" button on the loading page. Removes the data, initiates the new scraping process. If process succeeds, back-up data gets also updated.
Old plot-files are deleted on restart and/or after new scraping initiated.
Project is deployed on nicksnlp.pythonanywhere.com -- DISABLED.
Queries preprocessing added, that removes unknown words, "and"/"or" at the beginning or the end of a query, and separates brackets with spaces.
Structure of the function self.search has been modified.
Lemmatisation improved in search modes Relevance + Wildcard.
Validators are applied on web-scraping, that assures that the links a correctly formed.

POSSIBLE FUTURE IMPROVEMENTS:

Improve Boolean search: add lemmatised search into "d*g and cats", "paints and cats".
Highlight words found in the text

Running the project:

The project can be run with the following commands:

Mac/Linux users:

Setting up the environment (python3 is already installed):

python3 -m venv demoenv
. demoenv/bin/activate
pip install Flask
pip install -U spacy
python -m spacy download en_core_web_sm
pip install beautifulsoup4
pip install validators
pip install -U scikit-learn
pip install -U matplotlib

clone the repository:

git clone [email protected]:nicksnlp/arthunt.git
cd arthunt

Run the flask:

export FLASK_APP=flask_app.py  
export FLASK_DEBUG=True  
export FLASK_RUN_PORT=8000

Then in your browser open: http://127.0.0.1:8000

A detailed description of the project:

This project is a search engine for on-going and upcoming art exhibitions at different branches of Tate galleries. You can search for exhibition info with a query!

The search is based on data from scraping tate.org.uk website. The data is saved in .json format, which speeds-up the launch of the program. A new search can be initiated by pressing "Scape the WEB again" button on the search-page (this is performed automatically in case the data is missing for some reason, e.g. interrupted process in the previous session). Depending on whether the scraping-process succeeds or not, the relevant message is displayed under the search bar.

Based on the search results found, a bar chart will be generated, showing the distribution (i.e., numbers) of relevant exhibition(s) at each of Tate's branch galleries; for each exhibition in the search results, the following information will be displayed:

the exhibition name
people names and other entities mentioned in the article (based on named entity recognition)
time period
location
a brief summary about the exhibition's content
a snapshot of an intro article
and by clicking to the "more info" button shown below each piece of search result, you can access Tate's website for that specific exhibition

The search engine has 4 different search modes. Search mode will be automatically selected based on the content of the query (and the activated search mode for an input query will be displayed). The search modes include the following:

Relevance Search
Boolean Search (activated automatically if the query contains logic operator(s), including 'and', 'or', 'not', and brackets)

and combination of those with Wildcard search:

Wildcard + Relevance Search (activated automatically if the query contains "*")
Wildcard + Boolean Search (activated automatically if the query contains "*" + logic operator)

In this version lemmatisation is applied to search modes 1 and 3. Exact search is performed on queries containing Boolean operators.

Also, in this updated version, a preprocessing of queries has been added. The brackets are separated with spaced and search is activated even, when receiving queries like: "and cat". Unknown words are also get removed from the query, including in queries with "*". This is displayed in the output.

Demo example:

If everything went well, the browser should display this home page:

By clicking on the "start searching" button on the home page, it goes to the search page:

And here is an example of search results displayed after inputting the query "Watercolour":

a bar chart that shows how many on-going/upcoming exhibitions related to "Watercolour" are at each of the branch galleries:
and the information about each related exhibition:

Name		Name	Last commit message	Last commit date
Latest commit History 327 Commits
back_up_json		back_up_json
demo		demo
static		static
templates		templates
README.md		README.md
data_visualization.py		data_visualization.py
flask_app.py		flask_app.py
flask_run_arthunt.sh		flask_run_arthunt.sh
gallery_search.py		gallery_search.py
web_scraping.py		web_scraping.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About our project:

Running the project:

A detailed description of the project:

Demo example:

About

Releases

Packages

Languages

nicksnlp/arthunt

Folders and files

Latest commit

History

Repository files navigation

About our project:

Running the project:

A detailed description of the project:

Demo example:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages