File size: 4,216 Bytes
f9c1262
b292b32
 
 
f9c1262
b292b32
 
 
 
f9c1262
 
b292b32
 
9a97b25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: Link Lazarus Method
emoji: πŸ”—
colorFrom: blue
colorTo: red
sdk: streamlit
sdk_version: "1.32.2"
app_file: wikipedia_dead_links_streamlit.py
pinned: false
---



# The Link Lazarus Method: Wikipedia Dead Link Finder by metehan.ai - Streamlit Version

A Streamlit web application for finding and logging dead (broken) external links in Wikipedia articles, identifying potentially available domains for registration, and saving them to a dedicated database.

## Features

- **Multiple Search Methods**:
  - Search by text to find Wikipedia articles
  - Search by category to find related articles
- **Dead Link Detection**: Checks external links for HTTP errors or connection issues
- **Domain Availability**: Identifies which domains from dead links might be available for registration
- **Restricted TLD Filtering**: Automatically identifies and excludes restricted domains (.edu, .gov, etc.)
- **Available Domains Database**: Maintains a separate database of potentially available domains
- **Real-time Logging**: Saves dead links and available domains to JSON files as they're found
- **Result Visualization**: Displays results in an interactive table with filtering options
- **Export to CSV**: Download results as a CSV file
- **Web Archive Filter**: Automatically ignores links from web.archive.org
- **Configurable**: Adjust settings via the sidebar

## Requirements

- Python 3.6+
- Required packages listed in `requirements_streamlit.txt`

## Installation

```
pip install -r requirements_streamlit.txt
```

## Usage

Run the Streamlit app:

```
streamlit run wikipedia_dead_links_streamlit.py
```

The application will open in your default web browser with three main tabs:

### 1. Search by Text

- Enter search terms to find Wikipedia articles containing that text
- View search results with snippets
- Process all found pages to check for dead links and available domains

### 2. Search by Category

- Enter a category name to find Wikipedia categories
- Select a category to crawl its pages
- Find dead links and available domains within those pages

### 3. Available Domains

- View all potentially available domains found during searches
- Filter domains by status (potentially available, expired, etc.)
- See details about each domain including where it was found
- Download the list as a CSV file

## How Domain Availability Works

The app uses these methods to determine if a domain might be available:

1. **WHOIS Lookup**: Checks if the domain has registration information
2. **Expiration Check**: Identifies domains with expired registration dates
3. **DNS Lookup**: Verifies if the domain has active DNS records
4. **TLD Restriction Check**: Identifies restricted TLDs that cannot be freely registered

Domains are flagged as potentially available if:
- No WHOIS registration data is found
- The domain's expiration date has passed
- No DNS records exist for the domain
- The domain does NOT have a restricted TLD (.edu, .gov, .mil, etc.)

### Restricted TLDs (Optional)

The following TLDs are recognized as restricted and will never be reported as available, if you choose to filter them:
- .edu - Educational institutions
- .gov - Government entities
- .mil - Military organizations
- .int - International organizations
- Country-specific restrictions like .ac.uk, .gov.uk, etc.

**Note**: For definitive availability, you should verify with a domain registrar. The tool provides a starting point for identifying potential opportunities.

## Configuration Options

- **Log file path**: Where to save the dead links JSON results
- **Available domains file**: Where to save the available domains database
- **Max concurrent requests**: Number of links to check simultaneously
- **Max pages to process**: Limit the number of articles to process

## Output Files

The app generates two main JSON files:

1. **wikipedia_dead_links.json**: Contains details about all dead links found
2. **available_domains.json**: Contains only the potentially available domains and where they were found

You can also download results as CSV files directly from the app. Make sure follow on X @metehan777 and LinkedIn www.linkedin.com/in/metehanyesilyurt for the upcoming updates and more tips&tools.